About this episode
Tim and Juan provide an honest no-bs summary of what they observed and learned at Big Data London and the AI Conference in San Francisco. Tune in to get the latest on Data and AI and short interviews from Peter Norvig (Google), Nazneen Rajani (Hugging Face), Fabiana Clemente (YData), Danny Bickson (Visual Layer), Gev Sogomonian (AimStack), Yujian Tang (Zilliz), Patrick McFadin (Datasax) and Ofer Mendelevitch (Vectara).
Tim Gasper [00:00:00] Hello everyone. It's time once again for Catalog& Cocktails, your honest no- BS, non- salesy conversation about enterprise data management with tasty beverages in hand. I'm Tim Gasper, longtime data nerd, product guy and customer guy at data. world, joined by Juan Sequeda, principal scientist and head of our AI lab. We've been on the road the last couple of weeks. Last week we were at Big Data London at the Olympia, Wednesday and Thursday, September 20th through 21st. I'm going to be providing you a quick update about some of the key themes that happened there. And Juan Sequeda is actually on site right now at the AI conference over in California, and he's going to be providing you an update of what he's seeing on the ground there and interviews that he's talking with people about what they're seeing in the field of AI, what they're seeing from a technology standpoint, a case study standpoint. Stay tuned to just a few minutes here to hear from Juan talking to the various folks at the AI conference. First of all, just to kick us off, I want to talk a little bit about the Big Data London conference, which we both attended. And the big theme there, not surprisingly, was AI, AI, AI. Whether it was Snowflake, Microsoft, Gong, Coca- Cola, so on and so forth, all the organizations, vendors and companies alike were talking about the big impacts of generative AI, the disruption that's happening in every field, every technology. And that trend continued at Big Data London as probably about 50% of the talks had AI, something to do with, right? And as you walk to the vendor floor, you could look around and see people's headlines and things, their taglines, augmented analytics, AI, these things were showing up on probably at least 40% of different boards as you walk the stalls. And everyone's excited for good reason. There's a lot of really great innovation happening, a lot of creativity, a lot of productivity, a lot of scalability that's coming from using generative AI in products and in the daily work. And there was an obvious theme across all the different talks that you go to saying that obviously be cautious about some of the security and privacy challenges around generative AI, but the impact is undeniably strong. And those who are working in the field aren't going to get replaced by AI. You're going to get replaced by people who are using AI. So get out there, use AI, become comfortable with the technology. I think one of my favorite quotes actually comes from the Thursday morning keynote panel where Di Mayze, the global head of data and AI at WPP mentioned that in order to achieve AI, it's really critical to have your underlying data foundation right, so metadata management, data governance, context, that was really key to them and that was something that the entire panel really agreed with. Get your data house in order if you want to take advantage of AI. Other key themes at the conference, business value. Really happy to see that, because when we get really excited about AI, I often think back to the big data craze of about 10 years ago and how much we just got so obsessed with the technology we really lost touch with the business value piece. A lot around case studies, business value. Cybersecurity also a very important topic, especially with this new lens of how do we protect ourselves from new AI agents and AI threats? Also, how do we use AI in order to make security better? For example, there was a talk enabling data innovation with a data security platform from Satori and actually the director of product security over at Gong that I thought was particularly good. Cost savings was a theme, FinOps, CloudOps, performance management. How do we optimize our data warehouses? Folks are starting to really go all in on Snowflake and Databricks, and for good reasons they're worried about the cost they're spending on these technologies. Also, as folks are starting to experiment with training their own AI models and things like that, they're a little worried about that too. How much are we going to spend on GPUs and things like that? That stuff isn't cheap. It's a little cheaper when you're experimenting and you can do things on demand, but if you want to train something in an ongoing fashion or do inference in an ongoing fashion, then that can get quite a bit more expensive. Those are interesting things to think about from a cost saving standpoint. Real time and streaming continue to be themes, whether you're looking at things like Kafka, or StreamSet, or some of the other technologies around streaming technology, obviously a lot of really great players there. And then also really I think, one final thing I want to mention, is data mesh was definitely the biggest theme of Big Data London last year, and I think we saw a much more muted approach to the data mesh this year. You still heard it mentioned, often positively, sometimes a little backhandedly, but I think really the data mesh craze has settled down and the conversation is much more shifted now towards data products. How do we create really good data and analytic products in our organizations? How do we apply data product management and the best practices of product management to make better products, to focus the surface area of what we're doing around data? A lot of good talks around that. That's my summary of Big Data London. Really fantastic event. Juan and I were really happy to be there and thank you to all the folks that we got to hang out with. If you missed our episode last week, we talked with Chris Tabb from LEIT DATA. He is a maven in the data community over in the London UK area, so definitely check out that episode, connect with Chris. And now I'm going to pass it over to Juan who is onsite over at the AI conference to talk about what's going on there and all his exciting interviews. Juan, over to you.
Juan Sequeda [00:05:36] I'm here with Peter Norvig. Peter, your honest no- BS take to the data world or to the AI data world. What is it?
Peter Norvig [00:05:43] We're at a really exciting time right now. Things are happening quickly and things that you couldn't do a month ago, now work. I guess, I would advise everybody to be careful and say, let's use these tools responsibly and let's think about in any application, who are all the stakeholders? Are we treating them fairly? Is the data we're getting representative of them? And try to build your systems responsibly and take advantage of this great technology we have now.
Juan Sequeda [00:06:17] Yeah, I think that's a fantastic point, because what happened is that we just focus so much on the damn technology and we forget about the people who were creating all these tools for. Humans, we got to make sure they're in there.
Peter Norvig [00:06:27] Yeah, that's what we're making it for, so let's think about it.
Juan Sequeda [00:06:30] All right, thank you very much.
Peter Norvig [00:06:31] Thank you.
Nazeen Rajani [00:06:33] Hi, this is Nazeen Rajani. I'm the research lead at Hugging Face. My take on data is that for supervised fine- tuning, data quality is key, so de- duplication filtering, definitely extremely important. But one thing that we noticed along with METAS finding on less is more for alignment, we also found that long is more for alignment in the sense you could fine tune your model on a very small number of prompts, but that are really, really long and you could get the same level of performance if you did it on 10x the data, but much shorter prompts.
Juan Sequeda [00:07:07] Awesome, thank you.
Fabiana Clemente [00:07:11] Hi, my name is Fabiana Clemente. My novice take on AI is definitely how much we overlook sometimes the context of the data, the importance of the data, and also what do we want to achieve for the business. We jump into building a model and sometimes we don't even understand the basics around the context of what we want to build. And that's for me, the no- BS take on the development of AI.
Juan Sequeda [00:07:34] I completely agree, because one of the things is that we forget that we're doing this to solve a problem.
Fabiana Clemente [00:07:40] Yeah,-
Juan Sequeda [00:07:40] And what is that problem?
Fabiana Clemente [00:07:41] To solve a real problem.
Juan Sequeda [00:07:42] And then we're just doing this all, just getting fun for the model and all the tech stuff, but let's not forget about the end users.
Fabiana Clemente [00:07:47] Yeah, exactly. And well, we have just spoken about the trouble of MLOps and that's a very interesting case of exactly that. Forgetting that data has a background and has a context.
Juan Sequeda [00:08:01] Context. Love it. Thank you.
Danny Bickson [00:08:05] Hi, I'm Danny Bickson, CEO of Visual Layer. And I can summarize my philosophy in two sentences. Everything is around data. What we do, we manage visual data and the current problem's people don't treat quality as important enough, so they jump immediately to the stable diffusions and all the cool things they can do with the data, but no one ever cleans the data, curates it, see what's inside. Once they miss this crucial step, they get very poor models. They will work on the models ten times longer, and that's a bit of a missing piece I believe, that needs to be more carefully considered.
Juan Sequeda [00:08:49] One of the things that we were talking about that we really connected here was on metadata in context. So what is your take on metadata in context right now?
Danny Bickson [00:08:56] Oh, that's really critical. We only deal with visual data, but visual data is not coming alone. You have a lot of metadata like where the image was taken, who took it, there are annotations, there are captions, there are bounding boxes, a lot of unstructured information is coming with the visual information. And if you ignore it, you are pretty much in a problem. You have to take both the visual data and then compare for example, the captioning and annotation and bounding box to the visual context, and only as a whole you can treat it as one piece of information and see whether it makes sense or not.
Juan Sequeda [00:09:37] Thank you.
Gev Sogomonian [00:09:39] Hey, my name is Gev Sogomonian. I'm co- founder at AimStack. I've been chatting with Juan about no- BS take on the world-
Juan Sequeda [00:09:49] What's your honest no- BS take that you want to send to the data and AI world?
Gev Sogomonian [00:09:53] Well, I think in general, building software has gone awful lot more complex than it used to be, so in order to make sense of things, you need to pretty much track everything that moves in your software nowadays, it's not the same software anymore. Software 2. 0 or whatever, layered models, a bunch of models, connected agents and all these things, you have to track them, you have to log everything that moves. That would be my no- BS-
Juan Sequeda [00:10:23] But you really have to log everything?
Gev Sogomonian [00:10:25] Everything. Everything that moves. Not everything, but everything that moves. You have been logging code that moves, right? You have been logging some of the trace backs that sort of moves, but if you add the AI into your software, then a lot of other things that move emerge as a result of that.
Juan Sequeda [00:10:42] One of the interesting conversations here is the parallel that we're seeing between software systems, AI systems, data systems, all this notion of logging, it's the same thing essentially, right?
Gev Sogomonian [00:10:54] Yeah. But I think fundamentally, AI is non- deterministic, right? Which means in order to know whether your system works...
Juan Sequeda [00:11:02] That's a good point. The non- deterministic of your source of inaudible-
Gev Sogomonian [00:11:04] Because the AI wants to make you log more. Your source of truth is not the code anymore, so you don't need to, it's not just about just tracking and seeing when the software breaks. You need to proactively think about it, because it's not deterministic. If it works now, it does not guarantee that it's going-
Juan Sequeda [00:11:24] That's an excellent point. That's a good one. I like that. All right, thank you.
Gev Sogomonian [00:11:28] Cheers. Thanks. Thanks for your time.
Yujian Tang [00:11:30] Yeah. Hi, my name is Yujian Tang. My honest no- BS take about AI data is we have to remember that all of these advanced AI large language models are just advanced stochastic concepts, advanced statistical methods, there's no magic. All they're doing is taking some data and finding patterns. And so you want to be very, very careful about the quality of your data.
Juan Sequeda [00:11:51] And then before we started recording this, you were telling me something about vector databases.
Yujian Tang [00:11:57] Yes.
Juan Sequeda [00:11:57] Give me your honest, no- BS, non- salesy take on vector databases.
Yujian Tang [00:12:01] All right, I hope that no one gets mad at me about this, but there are only five vector databases out there right now. One is Milvus by Zilliz, one is Pinecone, one is Chroma, one is Weaviate and one is Qdrant. And all the other ones elasticsearch, ROCKSET, MyScale, SingleStore, what they're doing is they're providing a vector search on top of a regular database. They're not a true vector database.
Juan Sequeda [00:12:23] Okay, so following on this, so what? If the end goal is for users to be able to just have that search, why do I need to be so pedantic on this has to be a vector database? Why?
Yujian Tang [00:12:34] That's a very good question and actually, you don't. What you really need is you really need to have something that performs at scale, at the scale that you need. And that is going to differ depending on what the scale you need is. I can only speak about Milvus in this sense, but Milvus has a real visible performance increase, I guess, over others. Once you hit maybe 100, 000 to a million, you start seeing this real performance drop off, and not just data ingestion or querying, but also the entire throughput, queries per second, queries per dollar, all of these different statistics.
Juan Sequeda [00:13:10] Is the argument that if you are trying to go do search but want to do at scale, then you really need to have a vector database, not just a vector search on top of another database?
Yujian Tang [00:13:19] Yes, that's true. Yes.
Juan Sequeda [00:13:20] Okay. All right, fair point. Great, thanks.
Yujian Tang [00:13:22] Yeah, thank you.
Juan Sequeda [00:13:23] All right.
Patrick McFadin [00:13:24] Hi, I'm Patrick McFadin. I work at DataStax. I'm an Apache Cassandra Committer. What's my no- BS take? Oh God, Juan, this is going to be good. My no- BS take is on this whole thing around vector databases that are just a vector database. Dude, it's a feature. It's a feature. And so many companies are trying to build a database around a feature. That's so hard and it's just not a differentiated thing anymore. Every database that is a database just added one thing, vector. And so if you look around and you're like, oh, I'm using database X, I bet you if you look really hard, it's probably supporting vector now. And if it isn't, they just announced it. So, yeah, I mean that's my hot take and I really feel bad for the vector databases that are out there, the ones that are just pure play vector, because they're going to try to become a database and they have to make up all that ground.
Juan Sequeda [00:14:17] 30 years of baggage of stuff that they could come up with. This is a good point. I think we see this in so many other places too, in MySpace and the metadata data place, it's like all these features end up being their own categories and then it's just complicating everybody's lives. I'm like, I got to go. I got to buy this other tool. Look at procure this other stuff. So yeah, I'm with you.
Patrick McFadin [00:14:36] Yeah, I mean that's the thing. And it is like file systems. File systems. You don't trust a file system until it's been in production for a long time, that was like btrfs. No one used btrfs until it had five or six years in production, because you don't want to lose a bit. Well, same with databases. You don't buy a database because it loses data unless you use Mongo. People that use Mongo don't seem to care that it loses data, but that's their problem, right?
Juan Sequeda [00:15:02] I got two hot takes out of that.
Patrick McFadin [00:15:03] All right, two hot takes. All right. Love it. Thank you.
Juan Sequeda [00:15:08] All right.
Ofer Mendelevitch [00:15:08] Hi, I am Ofer, I head developer relations with Vectara. My take home message for the world as of today here at the wonderful AI Conference is related to my talk today actually. We launched a new embedding model called Boomerang, and I think that embedding models weren't the focus for a long time. Everybody focused on GPT and Anthropic and actually the bigger language models. I think embedding models are going to start flourishing, become important. I heard for the two days, RAG, RAG, RAG everywhere. Of course that's part of what we do, but I think the key to that is getting the best embedding models and making them better. And I think the community, both industry and academia, will start working on that and make all of our lives much better.
Juan Sequeda [00:15:53] I totally agree with you I think. I come from the data space and I always hear about, yeah, is it this language model, this one, but we're never hearing about the embeddings and now we hear about RAG and I think this is something that we really look forward to learning more. Actually love your talk and looking forward to having you on our podcast too.
Ofer Mendelevitch [00:16:09] All right, thank you. Looking forward to it as well.
Juan Sequeda [00:16:13] That's it. And that's it for this episode of Catalog& Cocktails. It has been a lot in the last couple of weeks and next week we're going to be back with a guest, John Cook. If you have been following him on LinkedIn, you're going to see how he's all about right now, data products and including now generative AI and large language models and business value. We're going to be able to bring everything together, data, AI, and business value. That's next week at Cattle and Cocktails. Cheers everyone. Talk to y'all soon.