Demystifying Vector Databases with Bob van Luijt

Tim Gasper [00:00:07] Hello, welcome. It's time for Catalog& Cocktails. It's your honest, no BS, non- salesy conversation about enterprise data management with tasty beverages in hand. I'm Tim Gasper, longtime product nerd at data. world, joined by Juan Cicada. Hey, Juan.

Juan Cicada [00:00:20] Hey. I'm Juan Cicada, principal scientist and the head of the AI lab here at data. world. And as always, it's a pleasure to take that break and have a conversation about data, about AI, a lot of AI stuff. And today super excited because we're going to have somebody who I've realized that I've kind of been wanting to connect with him and over so many years it just hasn't happened until finally right now. And that is Bob van Luijt, who's the co- founder of Weaviate. Bob, how are you doing?

Bob van Luijt [00:00:46] I'm doing well. Thanks for having me. This is great. A little bit early for me to have a cocktail though, but for the rest I'm doing great.

Juan Cicada [00:00:54] We always say it's not only 5: 00 PM somewhere, but it's sunset somewhere, right? That somebody's leaving the podcast. So. All right Bob, well, we're going to be talking diving into what Weaviate and vector database and all the AI stuff, but before that, quick, what are we drinking and what are we toasting for today? I think we're all on coffee right now, but any special coffee?

Bob van Luijt [00:01:14] So no, just a regular cup of coffee. I'm staying in a beautiful, I'm currently in a houseboat in a Marin, so just north of the Bay, so that's absolutely beautiful here. So what we're toasting for, I was reading something this weekend about game theory because I like that stuff. I'm interested in that kind of stuff. And I was reading about games that were played to solve the prisoner's dilemma. And one thing that made me very happy, and that's something I would like to toast to, is that you can have different strategies, so you can have a nice or forgiving strategy, or you can have an aggressive strategy. And turns out that over time, by all the research that is done, it is actually being nice, the nice strategy seems to always win over time. And I think that's nice because in Weaviate we have a value, our first value is kindness. And I was reading this about game theory, I was like, hey, isn't that a beautiful thing? So actually in nature there's proof that doing the right thing works over the longterm. So that's what I would like to toast.

Tim Gasper [00:02:32] I love that. That's fascinating.

Juan Cicada [00:02:35] Can't top that. To be kind, to be nice.

Tim Gasper [00:02:38] Yeah.

Juan Cicada [00:02:39] Cheers. Cheers there. Love that.

Tim Gasper [00:02:41] Be kind.

Bob van Luijt [00:02:43] Yes.

Juan Cicada [00:02:44] All right, we got so much to touch and talk about. Let's dive in. Bob, honest. No BS. What is a vector database?

Bob van Luijt [00:02:52] A vector database is a database where in the, and this mostly has to do with the search index, the vector embeddings are the first class citizen. In a time series database, everything is around long time, right? In the graph database, everything is about a nodes and edges. I don't know. In the vector database, the first class citizen are these vector embeddings. And the vector embedding itself is as old as math. I mean, exaggerating a little bit, but they of course start to play a prominent role when machine learning came in the last wave of machine learning, if you will. And that's what a vector database does. That's way more to it because it's also, but we probably dive into that as well, but it also has to do with how developers interact with the database. But at its core, its heart, if you, as I always like to say, peel off the onion, at the heart, it's about the vector index that sits in the database.

Tim Gasper [00:03:57] Yeah, that makes sense. And I know that vector databases aren't super, super recent. I mean they're maybe a little bit more recent, but they're not super recent, only to sort of this recent AI craze. But I know that because of gen AI really taking the forefront I think in the technology mindset in the market, there's become renewed interest in focus on vector databases. Can you talk a little bit about that? What is this renewed focus, and why is vector database is such a hot thing right now?

Bob van Luijt [00:04:29] Yeah, so everybody who's ever worked with machine learning models, it can be open source or you can find the hogging phase or you go to one of the proprietary model, well, it doesn't matter. You often see three types of models. I mean there are more, but generally speaking you see generative models and that's the whole what's happening now where the whole started by the, I mean it already existed, but really that everybody, the world saw it through ChatGPT. Then sometimes we see re-ranker models, and we'll talk about what they do in a bit. And you see these embedding models. And for a lot of peoples, these embeddings models are a little bit weird because you throw a text at it, or nowadays multi modals or an image or audio, you throw it at it, and what you get back is an array of floating points. That looks a little bit awkward. I remember that I saw that for the first time very well. That was for me, that was 2015. And I was like, huh, what's this? But what it turns out, or we can do with that is that we can search with it because it's a spatial representation of whatever data we have. And so what we saw for that is search and recommendation solutions. Now, where it becomes interesting is that just comparing vector embeddings with each other and doing distance calculations, that's not that hard. That is you take one vector embedding, you compare it to another one, they say, okay, what's the distance? And then you compare it to another one and you say, what's the distance? And so on and so forth. So if you have a thousand embeddings, that's a couple of lines of Python, and you're good to go. Where the problem with that is that it is brute force. So it gets slower and slower and slower over time, the more embeddings that you add in the linear fashion. So now these algorithms started to emerge, often known as approximate nearest neighbor algorithms, where you basically say, well, it's this space. So we have nearest neighbors in that space. How can we in a more optimal fashion find what these nearest neighbors is, rather than doing that very simply by brute force comparing them to each other? And the A is interesting in this abbreviation because the A stands for approximation. So the cost, the thing that we have to quote unquote pay, is accuracy. Now bear in mind these accuracies are often in the 99 percentile, but still it's not often, not 100%. And you can actually tweak them. So you can say like, I'm accepting to accept a lower accuracy for significant speed bump. And here it becomes interesting because now the way that those kind of indices are built, the way that those kind of indices sit in the database, the way that you scale and shard a database and manage the database and so on and so forth, turned out to be a unique problem. So therefore we started to see these new flavor of databases where people said, " Okay, wait a second, this is actually a significant technological challenge."

Juan Cicada [00:07:56] So this has been the evolution of what you're describing from a technical perspective, understanding how to go manage. If you're treating this vector right now as a first class citizen and you're doing some type of query, and in this case you want to do nearest neighbor, you have to have these first class types of query approaches. And you realize if you're treated as a first class, you're going to have to go deal with problems that are treat the most as first class. Because I think what we're seeing right now on the market is people saying, " Oh, that's just a feature. Oh, well, these other databases can add that too," but then is an argument then they're not treating those as first class, therefore they're going to be maybe good enough for some use cases, but then suboptimal for other ones? So what I want to get towards is I am out there on the street hearing vector databases and other things. People are like, " That's just a feature." How do you respond to that?

Bob van Luijt [00:08:49] And the answer is that's true, but let me say that is true for any other database that exists. I can store a JSON object in Postgres. Why would I use MongoDB? Well, I can tell you why you might want to use that. I can make a graph relation in Postgres. I can make you an argument why you want to use a dedicated graph database and so on and so forth. But by the way, this is if you look at history, anytime a database for a new data type emerged, but we can do this in XX being general purpose database is as old as databases are. So that's fine. Actually, I would even go a step further. I think that's a beautiful thing because I think as I always tell our team, it's a champagne problem. Because if somebody said, "Vector embeddings? Really? We're not going to spend the time of integrating that." That means that, okay, great. Now you have a category in a niche with a very unique index that nobody wants. So the fact that everybody's integrating is a great thing, but I always say building a database is a bit like an avocado. So everything that we've been talking about now is the pit of the avocado. It's a hardcore tech thing, and we can talk for hours about that. So we can keep drilling into that. But building a database, there's company, there's also the developer experience around how you're interacting with that database. And it's all that combined is what makes the vector database unique.

Juan Cicada [00:10:33] That's a great T- shirt quote right there we have, build a database like an avocado. You have the pit. That's brilliant. And it's totally, I mean, I've lived this, right? My background is all the RDF semantic web graph stuff, and it's just very early on people was like, " Well, I can do this in databases." And 10 plus years ago, that's why I always had problems publishing in the database community because I'm like, " Well, no, we can already do this. We don't need that other stuff." Here it is, right? Do this all the time. Right? Totally get the history. I've lived that part from the- inaudible. But, okay, so let's start unpacking this, let's do the avocado, right? Okay, that's a pit. And yeah, we can dive into the pit and I'd love to do that, but let's extend it. So what? Where does vector databases here fit in the broader picture of AI? And as you can imagine, let's dive into this whole RAG architectures and basically these AI native applications that we want to talk about.

Bob van Luijt [00:11:30] Yeah. The database is like any other database piece of the core infrastructure. So what you often see with these new paradigms is that a stack starts to emerge. People listening might be familiar with things like the LAMP stack, right? Back in the days, that's how I started my career.

Juan Cicada [00:11:48] Yeah. Living MySQL PHP.

Bob van Luijt [00:11:51] Exactly, yes.

Juan Cicada [00:11:52] First company I did in 2004 or five, that was all a LAMP stack.

Bob van Luijt [00:11:56] Exactly.

Tim Gasper [00:11:56] And then you got the MEAN stack, right, with Mongo.

Bob van Luijt [00:11:58] Exactly. But now if you add, what does the A in the MEAN stack stand for?

Tim Gasper [00:12:04] Angular, right?

Bob van Luijt [00:12:05] Angular, yes. And what's very interesting about that is that it's very JavaScript heavy for good reason, because the developers using that kind of infrastructure, and I'll get to your answer with a little detour, but the JavaScript element there is extremely important when you're building web apps or normal apps because it's all JSON going back and forth and God knows what. So that's a very logical thing. Turns out that the lingua franca in AI is not JavaScript, it's Python. So that means that the way that the majority of people interact with the database is Python- esque, right? So in our case, the database itself is building go, and there's C optimizations in there. There's even assembly optimizations in there because your database, that's what you do. But the way that the developer interacts with it is Python- esque, right? The majority of developers, not 100%, but the majority of developers use Python. So what starts to happen is that the way that you interact with the database, the way that you bring the models into the mix, so again, it's no BS so I will definitely not do that, but I have to explain this for the listeners to know what I'm talking about. So with VVH, you can also directly integrate the model. So you just throw data at it and it takes care of factorization for you if you want to. You don't have to, but if you want to. The way that developers do that, the stack that starts to immerse on that, so that includes the language, that includes the core infrastructure, and in the case of AI, the machine learning models, is completely different to what we've seen before. And that's the flesh of the avocado, if you will. So that is the how we help people to build these applications and bring them to production. And the role that the database plays in that is that it's core infrastructure, and as new use cases started to emerge, you touched upon RAG, but RAG is step one. There's so much more happening there. Actually it's all still quite primitive. And there's this so much happening, and we can double click on all of these things to optimize it, make that better for the users. So it's a completely new stack that's emerging.

Juan Cicada [00:14:28] Okay, so let's dive into that. So what is the stack as of today, in April 2024? What does the stack look like today? And what is the approach to using that? What are the applications, what is the architecture looking like with that stack? And then where do you think this should be going towards?

Bob van Luijt [00:14:55] Yeah. We have the models of course, because it starts from the perspective of the models. And the models are super interesting in itself in the stack because the models they have, the majority of models now currently are consumed through APIs. That is for production cases, sometimes suboptimal. So what's not happening with the open source models is also super interesting. But that's part one of us. Take the models, we can double click on why that's different in a bit, but that's okay, the models. Then we have the infrastructure, so that's a combination of model serving and the vector database. So for example, tools like Ollama and that kind of stuff, they help serve. Then we have the models, we have the database, and then sometimes we have stuff related to ETL, I guess if you will, or the glue to glue stuff together. So more ETL-like is something like unstructured. Of course we have a link chain, line index, those kind of things. So that's more how people glue things together. So that's how that stack is emerging right now. So the models and the database, yeah.

Juan Cicada [00:16:12] Okay. So the models, all the existing models you have in there, the infrastructure, so you're serving these models, I like to do that. So that can be using these Lama index or LangChain, all that stuff. You have your vector databases. So those two are found in infrastructure. And then we talk about the glue, the ETL, the orchestration. But this is one first. I think the assumption is that this is always, we're talking about unstructured data. That's one question. And then second is, how is that glue ETL orchestration be done? Is that just Python or is that something else? Or people are starting to go create tools around that?

Bob van Luijt [00:16:50] No, so I would say in the glue, that's it. That's the tooling, right? So that's where a LangChain, Lama index, et cetera sits, right? And what I meant more with serving is tools like Ollama and those kinds of things that just focus on serving, but you also have Ray and those kind of things. So models effect database model serving tools if you will. So in broad bucket. So those four things start to emerge, but it's still moving about because for example, what I said about the models, that's super interesting. So serving a model is hard, and especially if it's for production use case. However, a lot of work is being done by the community to actually optimize that. And now with the leaderboards and the quality of models starting to flatten out, we're starting now start to talk about percentage points, and that will probably be less couple decimal points in the not too distant future. Other questions that play a role, price, how much does it cost, latency, those kind of performance kind of questions. And we might look at a seismic shift there that we say, hey, actually it now becomes easy enough to run that myself. So I'm curious to see what will happen. And business- wise, selling a model is super interesting because it's stateless. It's like the moment the model is out in the free world, like an MP3 file, it's out there. So these companies building these models, it's very interesting to see what will happen in the near future, how they will present these models to the world.

Tim Gasper [00:18:37] Interesting. Yeah, so each of these pieces has its kind of unique dynamics, and it's different than what we've seen with some of the stacks in the past. It's not like, hey, replace database with something, replace the app layer with something. There's some corollaries, maybe things like LangChain are a little bit more of the app side of it, the infrastructure. You've got database components like a vector database, but then the models are a bit of a new component, almost like a backend component that's a part of all of this. Vector databases obviously plays a really key role as part of this stack. How are you looking at some of the use cases around the vector databases around AI? And then also, I'm curious about how are you looking at not just unstructured data, but structured data as well? Are you seeing use cases where folks are trying to use vector databases for both unstructured and structured data use cases? Or do you see knowledge graphs or graph databases, other types of things start to work their way into the stack as well?

Bob van Luijt [00:19:48] Yeah, maybe it's interesting to answer this in a reverse order. So first of all, yes, because if all these databases start to accept embeddings, then of course we start to see use cases emerge in that domain as well. And it's just super interesting too, the moment that you add an embedding to a data object, let alone if that's structured or unstructured, you just give yourself a different opportunity to index and to search through the data. So it's very much like in a traditional MySQL database or whatever you said, yes, index this column, no, don't index this column. So what we simply can do is that these floating points, these array of floating points that we store in a column or in a graph or wherever, we can say, okay, yes, index them. The use cases that we see that start to emerge out of that is the very first, very basic one, the one that was the most low hanging fruit is what we like to call better search. Because funnily enough, if you think about it, search plateaued in keyword search, and I'm talking purely about text for a second, which is very optimized. And I mean the tool to work with is Lucene of course, that we see that that comes back everywhere. If we deal with keyword- based search, that's under the hood of all these amazing tools that are built over time. Now the question is if you add vector search to that, is it an add- on or do we actually reverse it, is that the engine? And for a vector database, the latter is true. So that brings me to the use cases. So if you have use cases that are very focused on unstructured data, and then the question is define unstructured, that's actually not that easy. But these unstructured data sets, and then if you look like how much unstructured data do we actually have in the world that's estimated, we're like, what is it, 89% or something? These numbers are huge. So the first use case is very search and recommendation focused. However, the second use case that came out of that is something that also started to emerge. That is the RAG use case, the retrieval of generation. But the third one is also emerging, and that's what we like to call generative feedback loops. That's born out of RAG. So we can double click on all of those, but now these unique use cases for the vector database start to emerge. And that's a super interesting time. It's just super interesting, it's super exciting.

Juan Cicada [00:22:37] Let's double click it. I think search recommendation, got it, RAG. There's so much stuff about that already. This third one, this is new. Let's unpack that one. Generative feedback loop.

Bob van Luijt [00:22:48] So if you have a RAG pipeline, and I'm going to assume that the audience can visualize what that looks like, we augment something generated with something that comes from the vector database. That's a one- way street, so that it's like we have a query, the candidates are selected from the database, these candidates are part of the, in this case, the prompt, and then something comes out at the other end. That's a one- way street. The feedback loop, as you'd already might suggest, is that you actually attach that output back with a vector embedding into the vector database. So what you now can start to do is that you basically, as I always like to say, you can give crud support to the model, oh, sorry, the crud access to the model. So you can tell the model, " Hey, there's something in my database and it might be wrong. Can you fix that for me?" So you give it an instruction to generative feedback loop. So the example that we have on our website is with Airbnb data is that we're saying there's Airbnb data in this widget, but some of it is missing or some of it is wrong. Fix it.

Juan Cicada [00:24:13] Shouldn't that be fixed in the source? Because the vector database is not the source. You're going to fix it in the vector database, but the actual source is wrong. So it's a band- aid.

Bob van Luijt [00:24:24] Why can't the vector database be the source?

Juan Cicada [00:24:27] Okay, you're assuming it's a source, but usually it's not the source.

Bob van Luijt [00:24:30] No, but I mean that depends, sure. But the thing is the power sits in the fact that when you store it back that it gets an embedding. You want to be able to retrieve it again to create more loops.

Juan Cicada [00:24:48] I get this, and it is really interesting to see. I like how you started off saying, which is true, which is that there's this one- way street, and I think that's how it's all been going on in the last year. And that's a low- hanging fruit just to go do this. But when people start using these applications, you're like, wait, wait, wait, I want go to back. I mean, at the end of the day, it's like these agent frameworks, I mean go back to you're setting up a state machine, you're doing some planning, you're figuring out all these things. It's not one- way street, it's going in circles and going all over the place and feedback loops all over. So if you're thinking about an agent framework that you're setting up, every node, every state to there is going to be talking to something. So you're hitting back that vector database for multiple things, or any type of database in there. So I think this makes a lot of sense, and I'm glad you're bringing it up.

Bob van Luijt [00:25:34] And this goes back, if we go back to the avocado metaphor, this goes back to the flesh part. So one of the things that we are currently working on is that if you store a data object in the database, let's take that in a Python- esque way. That's like you just have this couple of lines of Python, that's how you add a data object. Or if you create a collection in the database, there's a couple of lines of it. So what we now start to do is that you can actually give an instruction while creating your collection. So the instruction might be something very simple. So it's like every data object should be in English. So now you start to add data as you've always done in a database, but if you start it in Korean and you retrieve it, then it might return it in English. So we really start to merge, or as we like to weave, that's the name, Weaviate, we start to weave the database and the models together. So the usage of instructions in storing data, retrieving data, those kinds of things, that starts to become intertwined with each other. And a gen feedback loop is a first iteration of that.

Juan Cicada [00:26:57] So now you're starting to get into issues about the semantics and the meaning, and I think one of our past histories here connect on all the knowledge graph stuff. So how did knowledge graphs fit into all of this, given your background, and how are you seeing this evolving in the future?

Bob van Luijt [00:27:17] Yeah, I've been like I think many people who work in the linked data space. At some point you see the light and you go like, oh, life could be so simple. We just define our ontologies that people come up with these beautiful things like Schema. org and what have you, and then we're just going to store the data based on these schemas, and then world peace. That's how we're going to solve everything. And in my early twenties I was a big fan of that, big fan. And at some point I was, because before I started Weaviate I was working as a freelance software consultant I guess, and I was hired by a bank. And we were talking about this because back then in banking you had something that came out that was called PSD-II, that has to do with the fact that basically the bank needs to have APIs. Long story, but that's basically what it boils down to. So I like this is beautiful opportunity to just define these ontologies, how the bank is structured and those kind of things, beautiful JSON schemas and what have you. And then the person who hired me said, " That's not possible." I was like, How do you mean that's not possible? Of course that's possible." And he gave me an assignment and he said, " Okay, I'm going to send you to four people and you need to define a JSON object that represents a customer." And so I went to this first person you spoke with, this person defined the JSON schemas. That's easy. So then I went to the second person and the person, " Well, I agree with this, but it's a little bit different for us." And then I went to the third person, same problem. Long story short, when I spoke with all four of them, I was unable to capture the definition for the bank of what a customer is in a JSON object. So what I started to learn is the problem is not the beauty that sits in Langdena. The problem is us humans. We do not agree on stuff. And then I was in a situation that I, and now it's 2015, and we have a mutual friend, he was also on our Weaviate podcast, Paul Growth, and we spoke about this on the podcast so I can say that here. He was the one that introduced me to embeddings. And that was back then glove, if I remember that correctly. But with individual words and these embeddings. And the idea was born, hey, what if I take a paragraph of text that describes something and rather as a human, I'm going to try to make that connection, this semantic relation in a link data format. I'm going to have the model predict it based on centroids of what's in the text. And my first use case actually was an IT project for that because I got data from different vendors and for, I don't know, elevators. And then different vendors had different definitions of their elevators. So I was like, can I connect that together based on these embeddings? Then I was in Mountain View for Google IO where in 2016, Sonder said on stage, "Hey, we're going to move from mobile first to AI first." And I was like, I know what they're doing. Of course, they do that with these embeddings. And that was kind of the history and the role, and that's how Weaviate was born. And we learned, actually. So you can find stuff online of me talking about Weaviate and describing it as a knowledge graph. And that's because with that word or that terminology, vector database wasn't used yet. And it was then with my co- founder and I was like, hey, wait a second. The opportunity we believe here sits not per se in predicting these links. The opportunity sits in building a dedicated database for these vector embeddings. So that's the role that the knowledge graph played or linked data in the birth of Weaviate, if you will.

Juan Cicada [00:31:49] This is really fascinating. You got into this history here, and if I unpack this, and this is my interpretation here and I'd love to get, if you agree with me or not, is at the end of the day, the knowledge graphs are done to be able to go represent our data and knowledge here in a connected way. And with vector embeddings, it's another way of managing that where I can actually start making suggestions of what these relationships can be able to automate a lot of that stuff that just without the vector databases, without the embeddings, without the nearest neighbor type of stuff, I would be doing in a manual way or rule- based way, which wouldn't scale here. But the combination of these two is really that ideal scenario that it's not one or the other, but it should be, we should figure out how to combine them together.

Bob van Luijt [00:32:42] Yeah, I mean, yes, but we came a long way for that. I remember that I was at the first linked data conference where, I forgot what the title of my talk was, but that must have been something like solving unlinkable entities or something with machine learning. It's something like that. Well, the audience was not impressed. So that was like a, what? With machine learning, are you nuts? How can we then guarantee the relations? And I was like, well, you can't, but you are currently in a situation where it's binary, so the connection is there or it's not there. And with machine learning, you can predict one of the relations. So I can be 100% certain that this relation is a thing, but better than nothing. And today that's kind of common ground. Today we're completely cool with that, but back when I started, people were not cool with that. They didn't think that was a good idea. As you guys know, the linked data community can be very, how should I say? Strict.

Juan Cicada [00:34:03] Yeah. Well, I think a lot of the research that we've been doing here I think comes more from the structured side. And I think I guess on some aspects you do want to make the suggestions and stuff, but when you're actually asking questions over your structured data, you really need the accuracy. You really need the ability around these things. So I think that's how we are seeing where you're looking at knowledge graphs, doing question answering over SQL databases, but then also you want to be able to have that feedback loop back to the users. Well, you ask a question, there may be some ambiguity here, I should be able to go figure that out by looking at things in the vector database and saying, " Hey, do you mean this? Do you mean that? Do you mean a B or C? Oh, I mean this thing specifically." Okay, great. Now I know this. And then I can can go down the structured accurate route using the semantics to be able to question it. So that's why I see it's both, but it's really about the use cases where accuracy are inevitable.

Bob van Luijt [00:35:03] Yeah. May I pull my be a theoretical asshole card here? May I play that card just for one minute?

Juan Cicada [00:35:12] Please do.

Tim Gasper [00:35:13] Sure.

Bob van Luijt [00:35:14] So a lot of data is textual data, right? I mean, I can make the same arguments for non- textual data, but let's just also for the audience, it's easier to visualize with text. If I have a data object that contains a sentence, Juan, Tim, and Bob are recording a podcast. That's the string I have. How is that unstructured? The structure and the language is very clear. So the problem with unstructured data is the longer you keep double clicking on it, if you define it and you double click on it, it's like, well, it's not that unstructured. As long as you have a coherent sentence that tells something about the data object that you're storing, it's structured. And embedding is just a different way of revealing that structure than a very plain A to B connection. It's just a little bit more elaborate, if you will. It's able to contain more information than a direct A to B link. So the argument is what we kind of done is that we said before vector embeddings were mainstream, we had the status quo cutoff, and everything that we could solve before that we call structured, and everything that we could not solve after that we call unstructured, which is verify... It's fine, but in reality, if you really double click and you look on the data you're having, it's not unstructured. So once, this was before I started Weaviate again in my life as a software consultant, it was like a data warehouse, there was a data warehouse with all kinds of products and this person is like, look, this is unstructured. So I looked at the description fields and what this person meant was that they had factories all over the world and the data standard that they had was everything needs to be stored in American English, but guess what? The factory in the UK wrote it in British English and the factory in France, well, they just wrote it down in French. They were like, no, we don't do stuff in English, we do it in French. So it's like you call that column unstructured, it's actually what it contains the descriptions of the products. You say it's unstructured because it just doesn't adhere to your standard. But now today with vector embeddings it doesn't matter anymore. You can store the information in multiple languages and still retrieve it based on an English query, or you can write it in Spanish and retrieve the data from that column with vector embeddings and models. So not to be too nitpicky, but if we really double click and look at what actually is unstructured data, that's actually gibberish, noise, that's unstructured data. But for the rest, there's a formal structure in there.

Tim Gasper [00:38:27] I think this is interesting, Bob, how you're characterizing this, because I think that by shorthand a lot folks look at vector databases and they see that a primary use case is around this sort of textual information and representing it as the embeddings, and tend to then characterize it as, oh well, it's this unstructured information. And I think what I hear you saying is that that may be an oversimplification, and perhaps an unfair simplification as to the role that vector databases are playing and the use cases that they're supporting, that in many ways you can think of this as A, structured information, and B, representing and revealing the structure of information in really key ways. And that different perspective is really important to really understanding the role that vector databases should play. Is that kind of the right way to think about it?

Bob van Luijt [00:39:33] Yes. And I think what's getting mixed up in this, so for very obvious logical reasons, is the psychology of somebody new joining the party. So we have all these database companies and they're doing their thing, and then all of a sudden they see a competitor pop up. So another document database, another time series database, another graph database, another data warehouse, and then you kind of use it. But now there's a couple of new kids on the party and they go, no, no, we're just doing something new. We're standing in the corner here sipping our cocktails in the infrastructure space. Don't worry about us. We're just doing something new with these embeddings. And that makes some people a little bit nervous, so like, oh boy, what are they doing? How are they doing that? That is perfectly fine. That is the knee- jerk reaction that you would expect. So the problem, however, that comes out of that is unfortunately it sometimes happens that we forget a very, very important player, the most important player in this hole, and that's the user. Because the effect of this is that we confuse the user. And that is something you have to be careful with.

Juan Cicada [00:40:55] 100%. And I think, by the way, this is going so fast there, I want to do some more time to do this. We got to do another podcast around this because we haven't even finished this up. But I want to close with that because that was super key on the user. I remind everybody that data and knowledge management, we've been looking at this from a technical perspective, but this is really a social technical kind of paradigm.

Bob van Luijt [00:41:19] Yes.

Juan Cicada [00:41:19] And we just love to think about the system, but at the end of the day, we are building systems application tools so human beings can use them and answer questions and solve problems. So there's not just a one size silver bullet. Let's understand who those users are, understand what they're trying to accomplish, and that's how you define success. And then at the end, those users don't give a fuck if it's a vector, blah, blah, blah, blah, blah, blah.

Bob van Luijt [00:41:43] Exactly. So back to, forgive me, my avocado metaphor, I'll promise you I won't. But if we have an avocado, do we eat the flesh or do we eat the pit, right? The user eats the flesh. So it's the developer experience that we give people around that core database, how they interact, how to build applications. And that's also why I'm saying it is great that these existing databases at vector embedding support because that shows there's a need. But if you want to build AI native, so if you want to have the tooling, if you really build an AI app, that means that if you take that model out of the app, your app is that. That's what we do. And that's the difference what a vector database company like us tries to accomplish as opposed to somebody just... I mean, it's great if you can use it in something else. Do it. If you have a very strong graph database, there's a new startup that I spoke to that is using both Mayo for Jay and Weaviate next to each other to build what they're doing. Beautiful. Great. And that's what's emerging now. So I think that's a beautiful thing, but again, it's the way that people interact. And I do argue that I think that's something different than No SQL.

Juan Cicada [00:43:09] So we got some lightning round questions here to do a quick yes or no.

Bob van Luijt [00:43:15] All right.

Juan Cicada [00:43:16] So I'll kick it off. Number one. Do you think the AI stack is becoming pretty clear, or do you think there's still much flux, somewhat very much in flux?

Bob van Luijt [00:43:24] In flux.

Tim Gasper [00:43:28] Second question. We mentioned today about the LAMP stack and the MEAN stack, and just zooming on the MEAN stack, I think that was a very interesting phenomenon. And still around, and it'll very much democratize this sort of app boom for web apps and mobile apps. Do you see something similar that will happen with AI? Is there going to be this boom where everybody's building AI apps? Or do you think that's not the right corollary?

Bob van Luijt [00:44:01] No, absolutely. That will absolutely happen. But that is still moving. It's still moving about and it's still, as we mentioned, it's still in flux. So will it be there? Yes. Will it have the same ingredients with this new thing of models, infrastructure, frameworks? Yes. And a language, by the way, and a language and so core language. But we don't know yet what it'll look like because all of a sudden you think, oh wow, look at Mistral coming up and then boom, there's Databricks with something new. It's still happening. We are still doing... There's now this formal structuring of prompts with stuff like DSPY, we're doing a lot of work at that as well. That's amazing. And that's all being incorporated and being made part of it and what have you. But I think that the format models, infrastructures, language framework, that will stay.

Juan Cicada [00:45:09] Love it. Next question. There's a lot of focus on the models now. Do you think the model layer will become commonplace and commoditized?

Bob van Luijt [00:45:16] Yes.

Juan Cicada [00:45:17] All right. I love it, we're just like your-

Tim Gasper [00:45:21] Definitiveness. Yeah. All right, final lightning round question. So obviously the vector database space, which Weaviate is a part of here, is really booming, right? It's blowing up. Do you think that several or multiple vector databases will succeed, maybe specializing in different things? Or do you see there's going to be a winner?

Bob van Luijt [00:45:45] Oh, there won't be, so it will specialize, right? You already start to see that. The open source, closed source thing is happening. That's also exciting to see in fault. I think they're like two axes. How you can look at it, open source, closed source. Listen, I don't mean, okay, this is not a very enlightening answer, but if you forgive me for it, I was so fast with the previous answer, so now I'll take some more with this one. Exactly. So you can plot on two axes, right? So open source, closed source, always say no religious about open source, but open source is three things. Infrastructure, community, and deployment options, channels, that's what open source brings us. So two axes, open source, closed source. And then we have focus. So we believe that a lot of power sits, back to my avocado example, in that flesh around it. And others believe that it sits more in the pit. And that is kind of how it's evolving. But I believe that we must keep focusing on making the user successful building with AI. And it sits surprisingly often in the developer experience and how people interact with the database.

Tim Gasper [00:47:04] That's great. Thank you.

Juan Cicada [00:47:05] Again, we're closing with the user, the user, the user. All right. Wrapping up a couple more minutes here. Tim, take us away with takeaways. Here we go.

Tim Gasper [00:47:13] All right. So much good information here, Bob, I really appreciate you walking through what is a vector database and some of the key applications here. So we started off with honest, no BS, what is a vector database? And you mentioned that it is a database that really treats vectors embeddings as a first- class citizen, right? You have time series databases that treats time as the main dimension and time- stamped information as the main data. Graph databases really focuses on nodes and edges, vector databases focused on vector embeddings. And you mentioned vector embeddings are as old as math, maybe not exactly literally, but also not too far off in terms of as sort of a vector math was created, this was really key. And you mentioned that really there's three different types of models. You have generative models, you have re ranker models, and then you have embedding models where you're getting back an array of numbers. Spatial representation of the data and especially the embedding models are the ones that you're using these vector databases for. And in addition to being able to store these embeddings, you're able to run algorithms like nearest neighbor to very quickly performantly search and interact with and do all sorts of interesting use cases and operations on those vector databases. And when we asked you, is it just a feature or is it a database, is it just a feature, you mentioned, well, yeah, yes it is, but you could say the same thing about any database out there. Is being able to store and query JSON a feature? Yeah, it's a feature, but you look at MongoDB and they're a highly successful company focused around just that feature, and optimizing not just the performance and the kernel around it, but also the actual developer experience around it. And then finally, before I pass it to you, Juan, we talked a little bit about AI native applications and the stack. You mentioned that there's an AI stack forming and the lingua franca and AI is not JSON though, and JavaScript like the MEAN stack and things like that, it's Python. And so the AI stack really is around how do we make this Python experience really, really great to create these AI experiences? Juan, what about you? What were your big takeaways?

Juan Cicada [00:49:40] Yeah, so what's the AI stack look like today? I like how you presented this. You have the models, right? The majority now consumed via APIs that for production use case, this can be sometimes be suboptimal. That's what we're talking about these things as a client in flux. Infrastructure. We have model serving, we have vector databases, and then we have that glue ETL orchestration tooling, things like LangChain and so forth. Really interesting. You say questions that play key roles around the models. Think about the cost, the latency, the accuracy, and actually the business model of that model is interesting, right? It's stateless. It's like an MP3 file is out there. So I found that really great. I really like that analogy. On the use cases here, low- hanging fruit was better search and recommendation, second RAG. But I think the evolution of RAG we're seeing in the third use case you brought up was this generative feedback loop. Because if you look at RAG, it's just like this one- way street. And you really want be able to kind of what comes out of that, bring that back into the vector database, and essentially you're providing crud access into the vector database. So I think we've kind of concluded that the vector database are not just on structured. Embeddings are a different way to reveal that structure. And following our metaphor, by the way, in several episodes in Catalog& Cocktails we have this issue that we just got in a metaphor, we keep banging that metaphor and today was one of those on the avocado. I love this. So do we eat the flesh or the pit? Of course we eat the flesh. It's all about the users and people giving that developer experience using the apps to solve their problems. Let's make users successful. Bob, anything we missed?

Bob van Luijt [00:51:07] Nope, that was a beautiful recap. Thanks, guys.

Juan Cicada [00:51:10] There are no... Only humans involved here. To wrap up really quickly, what's your advice? Who should we invite next? What resources do you follow?

Bob van Luijt [00:51:20] The advice is be kind, like how we started, right? It's so exciting, what's happening in the space, let's help each other be successful, right? What's the second question?

Tim Gasper [00:51:34] Who should we invite?

Juan Cicada [00:51:35] Who should we invite next?

Bob van Luijt [00:51:38] Did you already have Paul on?

Juan Cicada [00:51:39] No, but we should have Paul.

Bob van Luijt [00:51:42] Yeah. Paul.

Juan Cicada [00:51:43] Paul Grove.

Bob van Luijt [00:51:44] Yes. Paul and I have been collaborating for a long time, so it's time for him to be on the podcast. And resources. What do I read a lot? Boring answer. Boring answer. YouTube.

Juan Cicada [00:52:01] Actually, I don't think a lot of people have actually brought up YouTube.

Tim Gasper [00:52:05] It's not stated often enough. And honestly, Bob, I spend a lot of time learning things on YouTube. I'm like, I want to understand LangChain a little more. Oh, this is a good video. Right?

Bob van Luijt [00:52:14] Yep. Yeah.

Juan Cicada [00:52:16] Bob, thank you. Thank you so much. This has been an awesome conversation where we went through so many different parts, and I think this is going to be part one because I really want to go set up more. There's a lot more. We didn't talk about agents, we didn't talk about bringing all the different types of databases and tools. Anyways, so much to talk about later on and applications and users and so forth. Bob, thank you so much.

Bob van Luijt [00:52:36] Thanks so much for having me, guys. And cheers. And you know where to find me. Bye.

Catalog

Explorer

Marketplace

Governance

Workbench

Catalog

Explorer

Marketplace

Governance

Workbench

Financial Services

Healthcare

Higher Education

Insurance

Federal

State and Local Government

Financial Services

Healthcare

Higher Education

Insurance

Federal

State and Local Government

Data Leaders

Data Engineers

Data Governance Professionals

Analysts & Business Users

Data Leaders

Data Engineers

Data Governance Professionals

Analysts & Business Users

Integrations

API Documentation

Reference Implementations

Support

Integrations

API Documentation

Reference Implementations

Support

Snowflake

Oracle Database

Postgres SQL

Databricks

dremio

Snowflake

Oracle Database

Postgres SQL

Databricks

dremio

Blog

Events

Podcasts

Webinars

Reports and Tools

Blog

Events

Podcasts

Webinars

Reports and Tools

Who We Are

Our Team

Our Partners

Why data.world

Who We Are

Our Team

Our Partners

Why data.world

Press & Media

Events

Careers

Legal

Contact us

Press & Media

Events

Careers

Legal

Contact us

Catalog

Explorer

Marketplace

Governance