AI/ML: Where Should The Focus Be?

Speaker 1: This is Catalog & Cocktails. Presented by data.world.

Tim Gasper: Hello, everyone, welcome. And it's time for Catalog & Cocktails, presented by data.world, the data catalog for leveraging agile data governance to give power to people and data. We're coming to you live from Austin, Texas. It's an honest, no BS, non- salesy conversation about enterprise data management, with a tasty beverage in hand. I'm Tim Gasper, a longtime data nerd and product guy at data.world, joined by Juan.

Juan Sequeda: Hello, everybody. It is Wednesday. It is time to take that break and chat about data, middle of the week towards the end of the day. And today we're going to have such a fantastic conversation because our guest today is somebody who had the chance to meet a couple of months ago and it was at some event where we just... It was a small event and we just started talking and we really hit it off on the topic of AI and knowledge graphs and that need to be able to bring in knowledge into machine learning and AI. My guest today, our guest today is Patrick Bangert, who is the VP of AI at Samsung SDS. Patrick, how are you doing?

Patrick Bangert: Hey, guys, I'm Patrick. Great to be here. Thank you for having me. I'm doing well. I'm looking forward to our conversation.

Juan Sequeda: Fantastic. So Patrick, let's kick it off, so what are we drinking and what are we toasting for today? You kick us off.

Patrick Bangert: I'm drinking Coke Zero, and I'm toasting to the future of AI.

Juan Sequeda: That's always a good thing because there's so much AI goodness going on that we need to be working towards us. How about you, Tim?

Tim Gasper: I am currently drinking a deep LMIPA and I'll also toast to the future of AI. I saw the, what was it, Elon Musk showing off his new robot, and I was like," Wow, we've come a long way, but we've got a long way to go," so to the future of AI.

Juan Sequeda: And I'm at home and I actually made just a little twist of the old fashion, but I also with Austin Still bourbon and I just put a splash of this nice, refreshing, sparkling water. It's just a good classic mix of an old... let's call it a watered down old fashion, which is really refreshing to go do that. And I'm going to toast for not just the... The future of AI includes knowledge. That's what I'm going to go toast for. So cheers to that.

Tim Gasper: Cheers.

Patrick Bangert: Absolutely. Cheers.

Juan Sequeda: All right, so we have our funny warm up question today, which is what's the most or what are the most unfocused situations you have ever been in? I know, putting you all in the spot right now because I came up with this question a couple minutes ago.

Patrick Bangert: Yeah. So one of the most unfocused situations that I've been in has actually occurred multiple times. And that is when a customer says," Okay, we've got this really large data set, why don't you let your AI loose on it and tell me what patterns are in it?" And at that moment, I was foolish enough a couple of times in my life to accept that project, have a look and do all sorts of things, clustering and forecasting and what sort of stuff you can do, came back with all sorts of insights and patterns and then the customer usually looked at me kind of aghast saying," But these are completely obvious. We already know all of that." Well, dear customer, I didn't because I'm not an expert in your domain. And that's what's in your data. If you want something else, you got to ask a question, right? Because if you don't ask a specific question, you're not going to get an answer. But that occurs more often than you might think.

Juan Sequeda: Totally with you. I don't need to go with mine, which is similar to that, which is unfocused situation is when I have agreed to say," Let's go boil the ocean." Let's go do all these things and we're going to go do all of it at the same time, and then without the lack of focus, it's really that. I think boiling the ocean is definitely the most unfocused situations I've been in.

Patrick Bangert: Yeah.

Juan Sequeda: How about you, Tim? You got a funny one?

Tim Gasper: It seems like there's a theme here. So if I was going to say a work related theme, it would be similar to the boil the ocean thing. It would be like," Hey, Tim, there's this company you've never talked to before. You don't know anything about them or what industry they are, but can you give them a roadmap presentation in five minutes?" And that's like," Oh. Yeah, I guess I'll just talk about anything and whatever." I think a personal funny situation is just parenting multiple children sometimes feels like a very unfocused situation.

Patrick Bangert: Definitely.

Juan Sequeda: All right, well, let's kick it off. Let's go into our honest, no BS discussion. All right, Patrick, honest, no BS, where is the focus of AI and ML today and where should that focus be?

Patrick Bangert: Excellent question. Depends what you mean by focus. If focus means where is the money and where are the people, it's definitely autonomous vehicles. I would say more than half of, globally, all the dollars spent and all the labor hours spent on anything to do with AI is spent either directly or indirectly on autonomous driving. And that's been the case since about 2010. Enormous efforts have been sunk into that task. And as far as AI, people are concerned that task is considered more or less solved. We're polishing some things off as a global community, but it's more or less there. So now to get the vehicles on the road, you're really talking about stuff like manufacturing and scaling and legal and compliance and all the non- AI related stuff. So where should the focus be instead, since we're considering this to be a solved problem? Well, that's a really curious question because where the focus should be is going to need to be followed by the dollars and the people. So who's going to be able to provide the amount of funding that's required to take several tens of thousands of people away from autonomous vehicle work and put them to work on that something else? Obviously, a prime candidate throughout human history has for, better or for worse, been the military. They generally have good funding and they also generally come up with problems that are hard to solve and so require lots of hours. My personal opinion of where the focus should be is healthcare. So at the moment, we have a system that we call healthcare, but that actually isn't a healthcare, it's actually sick care because the system relies on you to wait until you're sick, then go in with symptoms, and then the system's KPI, the metric by which the system is judged, by how quickly it can get rid of you. So you go in and you get a drug and you're sent home and the system says," Hey, I'm done." That's not really how it should be. The ideal system that all of us would like is," Hey, doctor, I feel perfectly fine, what should I do to stay that way?" And then you would expect lifestyle advice, nutrition advice, maybe some exercise, this and that. All sorts of things to do or not to do to stay that way. AI is, I think, the prime technology that can make that happen. At the moment, it's not possible because of the amount of hours doctors would have to spend with you. It's not realistic. But if you have wearable technology that, for example, at the very simplest level, it's the watches that we wear today that measure our pulse and our sleep patterns. At the more advanced end of things, it could be sophisticated devices that measure more complex things like blood sugar and so on, that could give you real- time intelligent advice for you, specifically you the individual, not you as one member of a billion people, but you the individual, based on your specific body, what to do, how to continue to be healthy, how to be healthier. I think that's where the future is. Combined, of course, with when you do feel sick, AI can help with the diagnostics, it can help the doctor be less of a data entry clerk, which is what the doctor is today, and become an actual caregiver to the human patient, with the AI taking care of things like the electronic medical record, taking notes, filling prescriptions, coding and billing for the services rendered. The doctor should be a doctor. He shouldn't necessarily be all these other things and the clerk and an accountant and a paper pusher. So AI can fill all those gaps and that's where I think the focus should be.

Juan Sequeda: Wow. I honestly thought that you were going to go start off with a very technical answer, and you really took us to this route of understanding where the money is. And I think that's a very pragmatic and realistic way how to go see this, and it is. I was not thinking about it this way, so I really appreciate this. Well, one side note here is I found it actually surprising that you're saying that autonomous vehicles, this is mostly a solved problem. But I would say it's a solved problem, probably for the first world. I don't see a Tesla driving through the middle of Delhi and Mumbai or Bogota or like that yet, so I don't know. Maybe I'm wrong here, but that's one side note on that part. I think there's more. And then drill a little bit down to the technical side, that AI is mainly more focused on the unstructured type of data that they're analyzing a lot. I guess those side of the problem are well understood, so that's a comment I want to bring up in there. And then on the healthcare side, I think this is also fascinating, you're bringing it there, on the side of we have much more devices, we have much more data that needs to be considered to be able to say," Hey, doctors, focus on your main thing. AI is here to go fill in the gaps." So I mean, two comments here on those sides. I don't know. Tim, I'm sure you want to go chime in here.

Tim Gasper: No, this is interesting to me. I actually like your commentary there. I'm curious, Patrick, what you think about that so far and where that leads you?

Patrick Bangert: Well, we shouldn't confuse the technology of autonomous vehicles with the particular company called Tesla. They're not the same thing. Every car company in the world is working on this in some description. There are numerous companies that we haven't necessarily heard of that are working on it and developing their own new vehicle that doesn't even have steering wheels and pedals and things like that on it anymore. Prototypes like this drive around the Bay Area all the time, they're not necessarily Teslas. They're working. These are vehicles that literally don't have any of the interfaces that we're all used to. The chairs face the inside, there's a little table in the middle, there's no steering wheel. These cars drive around the roads around me, every day they exist, they work. So the prototypes are there. It is not a first world application, but it is, let's say, an application for an urban environment that has relatively good quality roads that might have lane markings and things like that. If you want an autonomous vehicle to be level five, where it can drive off road in a forest or on a gravel path or something like that, then you are right and we're not there yet. But do we need that? I mean, I know that many people always want that last 1% where we go from the," Hey, we solved the 99 and now let's throw the efforts at the last 1%." But that's not really worth it. Most of us live in urban environments. Most of us drive around roads that do in fact have lane markings and are pretty well asphalted and marked and they're marked on GPS- enabled maps and all that. It's good enough. It's good enough for almost everybody at almost all times of the day and of the year, and so that's why I say the problem is solved. Now we just need to make enough of these vehicles.

Tim Gasper: That's an interesting perspective on it all. I feel like a lot of, maybe, the public is very skeptical in saying," Oh, it'll never get there. It's too confusing, these different situations and things like that." And it's actually interesting to think of it from a different perspective, which is that," Hey, actually the core technology here has been pretty well addressed and now we're really hitting the long tail here." And a little bit implied by what you're saying, I'm curious if you're kind of nodding to this, is that a lot of the core promise around the benefits of autonomous driving have likely been already achieved. And now it's just a matter of... Now we enter the different problems; societal problems, legal problems, ethical, insurance- oriented, how do we handle the finances of this, et cetera, et cetera, right?

Patrick Bangert: Oh, yeah. And those problems are gigantic. Don't underestimate that. We have the device, but we don't have any of the other stuff. For example, the question of who's to blame when there's an accident? There's no answer to that. There's not even a bad answer to that. There is no answer to that at the moment. The legal frameworks in all the various countries don't exist. And it's very possible that the general public in this country or others might say no thanks in one way or another. It could be that the infrastructure will not get built to actually house all of this. It could be that there are so many ramifications that society will change and we won't be happy with it. I can't make any statements to that. I'm not an economist or a futurist, but as far as the artificial intelligence is concerned, the stuff works.

Tim Gasper: No, that's super interesting. And yeah, I'm curious about how all these societal things get figured out because there's a lot of questions there. Countries that are willing to make more dramatic changes are probably the ones that are going to be able to embrace this kind of tech more. One thing you brought up healthcare as an example of what could be next here to really benefit. And I know we want to get a little bit more into some of the technology and some data, some specific data topics here in a second, but that healthcare comment really resonated with me. And I'm just curious from your perspective, what kinds of companies do you think are in the best position to potentially take advantage of providing that new healthcare future? Because on one end, we think of doctors working for hospitals and the health insurance companies and things like that. That's sort of one end of the spectrum. But then when you mention things like wearables and some of that, my mind immediately starts to go towards the Apples of the world and the Samsungs of the world and the companies that are developing the electronics and things like that, that help to bring that power to consumers. Just curious if you have any thoughts about who's going to be some of the primary drivers of that new healthcare world.

Patrick Bangert: Yeah, I do fact think that it's going to be the device manufacturers that will be at the forefront of this. It's either the devices that are going to be consumer wearable devices, or the devices that are huge in the hospital, MRI, CT, ultrasound type scanning equipment. The fitness watch that everybody might be wearing, the Apple Watch, the Samsung Watch, the Fitbit, whatever, those are examples of the entry level to this that measures a pulse and that's very useful. From the pulse, you can conclude about calorie consumption and really, really useful things like that. At the other end of the scale, you can have an MRI scan and then the AI will immediately tell you whether you've got a certain medical condition or not, compared to possibly having to wait weeks to have your doctor go through everything and explain it to you. So there's a medical benefit, there's an anxiety benefit to receiving immediate feedback. And of course, the AI models are more accurate than the average human doctor. Always the focus is on the average human doctor, we're not insulting any individual people, but the average doctor has a 70% accuracy rating on medical imaging, 70%. The other 30 go wrong somehow. Well, with AI vision models, you get to about 98, 99% accuracy rating, so that's better. So if the AI gives you feedback, it's instantaneous and it's more accurate. That's not bad. You still want a human to go tell you what to do about it, but the feedback from the AI is not bad as a first line of defense. And I think through that there will be more companies making devices that will be at either end and somewhere in the middle. You can imagine having a device that's not wearable, but that you might have in your home that will help you with a variety of things. You could imagine having a blood sample testing device in your house and submitting a drop of your blood every morning and getting a sort of analysis done. And then the feedback might be that you should have a certain type of food. That sort of thing, I think, will come over the next years. And the devices companies will benefit. On the doctor front, the doctor will have access to, again, devices or software programs running in a computer that will help. There is already software out there that can listen to a conversation between a doctor and a patient and distinguish who is the doctor and who is the patient, can record that conversation, transcribe it into text. The software, knows about medical terminology, knows about names of drugs and is able to extract certain facts. There's a verbiage and the computer concludes that what has just happened is the doctor prescribed a certain kind of medicine to this particular patient. We know who that is. The computer can automatically fill a prescription form. The doctor doesn't have to do that anymore. That technology exists right now. This is not futuristic. We just have to deploy it more. It's of tremendous help because if you think the doctor currently spends only 20% of their time with you, the patient. The other 80% are spent on doing paperwork.

Tim Gasper: I love the way that you're articulating this because it's really changing how I think about this too, where even I fall prey a little bit to the commentary of," Oh, soon the doctor's going to get replaced by the robot," or something like that. And I think a whole nother way to think about this is, at one level, the improved care that you're providing, but then at a whole nother level, it's also leverage, the fact that a single doctor now can serve a much larger community of people, can spend more time per patient, maybe you can spend more than just five minutes with your doctor. There's a lot of interesting possibilities that come together from this. And I like your way of expanding the thinking around this.

Patrick Bangert: Yeah, thank you, Tim. This is exactly the misunderstanding. People think that doctors are going to be replaced by robots. The actual fact is the opposite. Doctors to date are robots. They're data entry clerks, they're paper pushers, they're data form fillers. And a little bit, they take care of people. We need to provide them with technology to enable the current doctors who are robots to become actual caregivers.

Juan Sequeda: This is an excellent quote," Doctors today are the robots, and we want to be able to flip that." I think this is a very important way to thinking about it. All right, I love that. This is the 20, 21 minute mark on this. But all right, you're really painting this picture of everything's solved. You just said it, this thing already exists, we just need to go deploy it, but not everything is solved. So let's dive into what are the open issues right now? What are the opportunities? Where are the gaps right now in the technology? And where should the focus then be from a technology perspective? We already talked about from the economics and from the business perspective, all the money we're seeing. Talking about healthcare from a technology perspective, let's dive into that.

Patrick Bangert: Yeah. Here it's very interesting, that if you look at the literature, both in terms of the popular books, the scientific books, or the really edge of research scientific publications, the focus everywhere is really on the algorithms, on the mathematics of how to make these models, which types of models we can argue, convolutional and neural networks for transformers, how many layers and how many neurons. And I have a little new twist on the algorithm that gets me 0. 1% more accuracy than you did," Haha, I get to be published twice more." That's the focus. But we're already at a point with that technology where the accuracies are so good that the potentials for improving them are small and therefore we can question whether it's worth it again, again on an economic front, but I won't go there. Where is the problem? The problem is that the algorithms are excellent in reproducing whatever patterns the data has. And that's why a couple years ago, famously, Andrew Ng started this phrase of data- centric AI, which I believe really is the real point of where the problem lies. The problem is not with the algorithms, not with the models, not with the mathematics. We figured that out. Again, to the point of stuff we've done. The math is done. Yes, we can improve it. The problem is at the data aside, how do we first of all get enough data, how do we make sure that that data is significant, representative, clean, transformed properly, it has the right features, doesn't have disturbing features, all of these things? And then we present it to the AI. And that process of those six or seven elements of making sure that the data is all right, that's where the work is, that's where the investment is, and that's where nine out of 10 AI projects today fail. Nine out of 10 projects do fail economically today. They get attempted and then at some point during the process life cycle, they get abandoned because the data is not good enough. So, it's here where the focus has to lie and it's, unfortunately, not very sexy.

Juan Sequeda: So, does that mean that this is not an AI problem anymore? This is a data integration problem?

Patrick Bangert: Not necessarily, because you cannot separate the data from the AI so cleanly, right? Yes, you have to prepare the data for AI, but in preparing the data, there are a few questions. For example, is the data clean? Are there elements in the data that shouldn't be there? Or are there elements in the data they were missing? Do I have the right features? Features, in that sense, is a technical term referring to columns in your data set, or in the appropriate transformations of images. Then, is my data biased against something? Recently in the media, we've talked a lot about biased against certain groups of people, like African Americans or women versus men or something like that. But it could also be biased against other stuff. Is the data of representative. So for example, if I'm dealing with medical data and I have 10,000 images, 9, 800 of which are healthy people, and I have 200 sick people. Out of those 200 sick people, I have 30 different diseases represented. That's a really, really bad data set for trying to detect diseases because I'm over representing the healthy case, stuff like that. In order to grasp, is my data in fact good, I need to apply a numerous analysis algorithms that will tell me on various metrics how good and bad I am. That is partly AI. And so, AI is involved in here. And then, of course, if I discover that I'm missing stuff, I have to fix it. Again, AI algorithms will be involved in helping me fix and overcome these problems. How do I over and under sample correctly? How do I pick the right features? How do I choose the right model? How do I run the hyper- parameter tuning experiments to get to the right parametrization of my algorithm? All these things are AI powered, but the focus is on the data.

Juan Sequeda: So out of the things that you were saying, we're taking notes here, so you're saying is the data clean, if there are elements of the data should be there, are the right features and columns, is it biased? We can go off with the list of these things on how to prepare the data. And I would argue that many of these elements that we're going through are just data integration things, that regardless if it's AI. Hey, if I'm doing a dashboard in Power BI or whatever, I want to have those same questions answered there. Now there's other aspects that are probably not... The Power BI dashboard reporter, whatever, don't care about those things that you do. Your machine learning engineer would want to go do that. So I think you would separate... I think we can categorize them in two blocks. A concern I have then is that we're going to go see the entire ML, AI group doing all this stuff that all this data work that we're like," Hey, there's this other entire team that's already doing this and you guys are not talking to each other because you're the..." I mean, just because you're the data folks and we're the AI folks and stuff like that. This is a concern I have. I wonder if this is a concern that you have and how do we make sure... And if they are concerns, aren't they bridges that we need to be made? What's your perspective?

Patrick Bangert: Yeah, that's certainly a concern. But again, in addition to all the usual data transformation tasks, there are a number that are specific to AI, for example, choosing the right features. That's a very scientific question. That doesn't have anything to do with data storage and transformation and things like that. And that's something that for a realistic AI project might take two months for a five people team to actually figure out what are the right features. So these are not easy questions to answer. There are a few points that would not be contained in that siloed data team. But yes, there is a danger in siloing out your functions. And of course, we've all learned over the last 20 years that silos are bad, no matter what domain of business you're in. Silos are always bad. You've got to have your cross functional people and do your communication properly. Plus, you've got to rope in the domain experts. You cannot solve an AI task by just looking at the data and not having any understanding of where that data came from or what it means. So you've got to rope in the domain expert who's going to explain to you the context of where this thing came from and what you're supposed to do and what a solution means. And that, typically then degenerates into two main questions. One is labeling the data. The domain expert inserts manually, typically, the domain knowledge that he or she has into the data set. Very time consuming activity. And the second activity will be the insertion of knowledge without labels. That could be a knowledge graph, it could be an ontology, it could be some other form of knowledge representation that is not labeling every single data point. And without doing either one of those things, the project will be a guaranteed failure, because it'll just be a bunch of data, the AI will provide a nice summary of that data, but without context or a knowledge or guided focus on what a solution would look like. It's useless.

Tim Gasper: This is interesting. Can you go into a little bit more about how this issue that we have around needing more label data, good label data, and what are some strategies that you're seeing are effective for companies to try to address some of the problems in this area?

Juan Sequeda: Because this seems to be a neglected problem, right?

Tim Gasper: Yeah.

Juan Sequeda: We always hear the success stories, but oh, there's all this effort to go label data. But let's figure out what was brushed underneath the rug here. What really happens?

Patrick Bangert: Yeah. I mean, to give you an example, let's look at images. If we take images of the road and we see that there are pedestrians and cars and road signs and traffic signs and whatnot, of course, we understand what all that is. To teach the computer, what we would have to do is come in with an image and draw and outline around the car and say,"This is a car." We draw an outline around a pedestrian and say," Human." We draw an outline around the road and say," Okay, this is the road, and then this is not the road, but the sidewalk. And then this is also not the road, it's a building," and all that stuff. Now, any one road image might have enough stuff in it that you would take about an hour to draw the outlines around the various objects in that image and identify what they are. If you think that you go through a million images, that's a million hours, that's an entire working year for a group of a hundred people or more. And you multiply that by whatever salary you'd like to pay these people, it's expensive. That's the problem statement. Forget about the accuracy of these people making mistakes and all of that. It just takes a very long time. It's a very expensive process. And you would do, in fact, need the millions of images to get to a reasonable accuracy. So the technology that really helps with this is called active learning. It's a technique from artificial intelligence that says," Hey, we've actually discovered information theoretically, that in any large scale data set, the vast number, maybe 95% and more, of the images are so similar to each other that they're effectively duplicates." So out of those millions of images, it's really only maybe 3%, 4% that are actually informative, that contain all the information content. And the others are duplicates. Now the trick is, of course, in finding which files are those 3%. And that's the hard part. That's what active learning is there to do. It's a human- in-a- loop process. You start by selecting a very small number of images randomly and labeling them. You provide them to an AI system that then learns a little bit from this little bit of data and it produces a model. The model's task is not to identify people in cars and traffic signs of the image. The model is meant to assess its own reliability in judging which is a car and so on and so forth. And it will then give you a numerical score for every image that you have not labeled yet, giving the probability that I will be confident in determining all the different objects correctly. So all those images that score really poorly in that are the images I'm going to label next. I give it back to the system. It retrains again, does the same thing. I, again, label the images that the model is very confused about. Provide it, it learns, does it again, gives it back to me. I might repeat that process 10 times, right? Lather, rinse, repeat. All the time, I keep labeling the images that are most informative, aka the ones the model is most confused about. And at some point, I find myself having labeled about two, three, 4% at most of the dataset, and the accuracy of my model is now in the high 90%. Presto, I have not labeled 95 and plus percent of the data set, but I already have an accurate model, which means now I can auto label the rest and basically go through a checking process, do a few corrections here and there, and at the end of that I have my million images either labeled by me or checked and corrected by me, but I've not expended a year with a hundred people. I might have expended two months with 10 people. So the cost and the time has dropped so much that suddenly problems are within my economic horizon that before were just dreams. And that's particularly relevant, of course, again, for healthcare, where labeling is even more expensive because I need professional doctors to do it. They're not in infinite supply. And I have to make sure that I'm very efficient with the resources at my disposal.

Tim Gasper: Yeah, that makes a lot of sense. This triggers a question for me. And just before I ask it, I want to say our little comment here that this episode's brought to you by data.world. The data.world is the catalog for data mesh, the whole new paradigm for data empowerment. Patrick, my question for you here is that it seems like there's this, based on what you're saying here, there's this trend to use AI to help you build the AI. And you've said that a few times, I think, over the course of our show today, where AI is helping you to assess the reliability of your data, maybe it's helping you with labeling, it's helping you determine bias and things like that. So it seems like, is that an overall trend that we're seeing a lot more here? I don't know the right terminology, whether it's ensemble or something else, but being able to have all these different AIs work together in tandem for different parts of the pipeline to help accelerate the greater result at the end?

Patrick Bangert: Absolutely, yes. Yeah. I personally jokingly refer to this as AI squared, where AI helps AI or helps make further AI. It's kind of me going to a teacher. The teacher has experience of it and gives his experience to me in the teaching process. And then I, as the student, learn. Then eventually, I get older and then I'm the teacher of some new younger person and I pass on my experience. But it's the same way. So AI helps AI, yes, absolutely. And in the old days, we had a completely human generated data set and then we trained the neural network and that was it. Those days are gone. Now we have multiple algorithms working together on a single problem, various algorithms taking various parts. And just to define the term you mentioned, ensemble modeling. What that means is that I train multiple models on the same data set and then I don't throw the bad ones away. I keep these models. And in deployment, I use all of them and I just average them out. It turns out that averaging several models out is always better than using a single model. Of course, you pay for it," pay for it", in the sense that you have to expend the effort to train, first of all, multiple models. And in the end, you have to execute on multiple models as well.

Juan Sequeda: So labeling data, we've talked about how active learning is a key here, but you also brought in this is where the experts come in. Another aspect where expert come in is inserting the knowledge. Now, it's always, we all talk about this complete automation, human in loop, deep learning, all these things. And I think very little conversations having around the knowledge aspect, like inserting knowledge, representation, the symbolics and bringing in here and ontologies and knowledge graphs. Why is that the case? Are people just, they're really bullish in saying," We don't need that, and all we just need is more data and couple time for experts to go label more," and that's it? Or is this a missed opportunity, or is this inevitable, we're going to get there? What's your perspective around this?

Patrick Bangert: Yeah, there is a large section of the AI community that does think it's solvable via more data. And I think the recent couple of years of efforts into language models has conclusively proven them wrong. So if you look at things like remodels or the new GPT- X model series. If you have a casual chat with that model about how you feel and the weather, the output is fantastic. It would basically pass the Turing test. But if you dare ask questions that are a little bit more pointed, ask it about times of day or questions that would involve knowledge of gravity or ask it to do some arithmetic, the answers are terrible. And that's because it's been trained on a gigantic set of natural language utterances from the internet and books and whatnot, and nobody ever made the effort to teach the thing some actual logical reasoning. Nobody has explicitly taught arithmetic. Nobody has taught what do times mean, or if I say a mouse sits on an elephant, that's okay, and the elephant sitting on the mouse is somehow not okay. You don't teach it that, and so it doesn't know that. And that's a problem. It's not a problem if we want to have a chat. But if I ask it to be a doctor's assistant program, then suddenly it matters quite a lot that it knows certain rules. Certain things are not up to interpretation. In order to prescribe you a cancer drug, I got to be sure you got cancer first. Things like that. And that's a piece of knowledge, fairly rigid and structured knowledge, that is very, very inefficient to represent in the form of data. And if you want to make sure that that knowledge is absolutely followed in the sense of rules and regulations, for example, or compliance rules, then you cannot represent it in data because data will only be represented with a certain probability. So then you must declare that knowledge. And the right way to do that is not by rules. We've found that out with things called expert systems in the 1980s. They've been a spectacular failure in the meantime. So the way to do that is with an ontology or a knowledge graph where that knowledge is hierarchically structured and it's relatively easy, easy in quotes, mind you, to insert or take out of that knowledge graph, pieces of knowledge to make that knowledge graph either better and more encompassing or to remove an error that you've made. So that I feel is really a step change for the future in terms of the mathematical technology in AI to help make AI more useful for the everyday sense. Definitely in the sense of anything involving language. Those chat bots are not ready yet to be of real assistance.

Tim Gasper: Yeah, it's interesting to see some of the news about chat bots and on one end of like," Oh, my god, they're so good." And then," Oh, my gosh, here's yet another disaster story about chat bots gone rogue." It's interesting to hear you differentiate between certain use cases, where underlying structured logic is going to be maybe more critical. And I think you mentioned the AI doctor assistant perhaps benefits from a lot more core logic or something like that. On the flip side, you'd mentioned that autonomous driving is something that is a largely solved problem. Is that an example of a class of problem that was able to be trained a lot more on just lots of data, lots of different situations and didn't need as much underlying structured knowledge, versus a use case that does? Or is that overthinking it?

Patrick Bangert: Absolutely. I mean, if you think back to the days when you took your driving lessons and took a driving test, the amount of knowledge, genuine knowledge, facts in your head that you're required to pass that test is not very much. Compare that to a degree in computer science or a chemistry test in high school or something. The amount of knowledge you need to drive your car is minimal at best. It's a skill. It's a skill you acquire through hours of doing it. And there's really only one major KPI, right? Don't break the law, don't hit people. Everything else is more or less okay. So we think of autonomous vehicles as being this gigantic leap forward in AI technology," Oh, my God, is it great and fantastic?" This is one of the simplest skills that human beings have. This is the bottom of the barrel, I'm sorry.

Tim Gasper: It makes me feel better about the drivers on the street, but...

Juan Sequeda: Wow, I have never thought about it this way. This is a huge aha moment I'm having right now. And I think I've always thought it's amazing what they can go do, but you've really... You've described it in a way that, with all due respect, I mean, I kind of lost a little bit of the... I mean, I guess, honest no BS, lost a lot of the respect of all the work that people do on autonomous vehicles because," Yeah, it wasn't that hard." I mean, it's about the rule. I mean, actually, it's true, if you're going to go past your driver's ed test, the book is like this big. But if I'm going to go do my... I don't know, I'm just taking here my wife's book on APA. I mean, this is this book you had to go read. I mean, here over there, right? This is a very interesting perspective.

Patrick Bangert: Think of other everyday stuff; assembling an IKEA piece of furniture, changing the diapers on a baby, making dinner. That stuff is way more complex than driving a car.

Juan Sequeda: That's another excellent quote at minute 46. I want to get that down here. Changing diapers on a baby is much more complex and driving a car. Well, I'm kind of speechless. I haven't thought about it this way before. Now, one of the things that I'm thinking is, I guess, Doug Lennick and the whole psych and everybody doing these common sense, knowledge bases back in the eighties, I mean, they've been right all along. I mean, it is not that we just needed that, just that. I mean, that's a key piece in the big... It's a big set of puzzle pieces that I think we've ignored it because it's really hard to go do. But then, I guess, I'd argue that we've realized that we can do all the easy ones now. Driving a car is easy, but then all these other things we want to go do, which is probably much bigger than the easy ones, we need the knowledge and we need to be able to start investing in this.

Patrick Bangert: Admittedly, it has to be a unity. So again, in the eighties, we did try out ontologies and expert systems and we found them to be really bad. And these days we do GPT-3 and we find it's impressive, but it doesn't solve the problems either. The solution will become when we put both of those things together. We need the flexibility of the GPT type modeling, and we need the rigidity of the knowledge graph and the logical rules because that's how we human beings are. We're pretty flexible, but we're on rails for certain things. And we put those two together and then we suddenly become productive people in society.

Juan Sequeda: This goes back to what I was thinking is this is about marrying data and knowledge together. If we look at, knowledge for so much by itself, and that was the eighties and nineties, and we realized that by itself doesn't work. Then we started doing data. That's been working, we're hitting the bar, this brings it together. And I think that's actually how we first met is that we were talking about data and knowledge and the article I written about them, the history of knowledge graphs set. I have to say, that was a really cool moment that we met there, the hallways like," Hey, I've read your articles." a little bit of cool thing. All right, there's so much... I mean, I want to keep... There's so much to go talk about. Before we go to our lightning round, there's one thing, and when we talked about before, was where we should really be focusing on. And I think a lot of the AI is kind of very general AI, when you were saying we should be really focusing on the industries, more vertical industry AI. But we're not incentivized to go do that because, hey, that's not where the VCs are throwing the money or stuff on. I'd love to, if you can wrap us up and close us with what's the state of that... what's the state of that world and where should it be if it weren't for other external factors?

Patrick Bangert: That's a very, very important point. Of course, a lot of the funding in the AI world does come from venture capital. Not all of it, mind you. There's a lot of it that comes from governments, especially through things like the military. There's a lot of it that comes from established large corporations. But much of it comes from venture capital. And so, what VCs are interested in plays a large role. They are typically focused on some form of profitability, one of two ways. The one way is the Facebook way," We don't care about money, we just care about acquiring lots and lots of users." The second way is the traditional, old fashioned way," We don't care about the users. We care about getting lots and lots of revenue." So that distinction is very important, and most of the venture capitalists are in fact in the," We want users," group. And so then, this promotes companies; and I emphasize that I use the word company and not the word business, because those companies are not businesses; it promotes the growth of those companies that are simply wanting to target lots and lots of users. And that yields AI applications that are entertaining, that are funny, that are very marginally useful for anything real. And on the other end, we then have companies that actually want to be genuine businesses that want to pursue profitability and therefore have to deliver something that's of actual use. Healthcare would fall into that category, of course. And there you have compliance and regulatory and FDA approvals and things like that that make the hurdle quite high, and so lots of companies and venture capitalists are not willing to invest the requisite amount of money to take those hurdles. So those kinds of fields are underrepresented. Unfortunately, there is difficulty on the money front, but I think it is absolutely worthwhile because there are still a good number of VCs that are willing to fund truly groundbreaking, useful AI applications. If people have the right ideas, they will get that out there. And the market is ripe.

Juan Sequeda: This is an important message, I think, that where do you want to go focus all your smartness and your energy? I mean, we can go do entertaining and funny, but is it really that useful or not? I think that's a little bit of an existential question to go say," Hey, what are you interested in?"

Patrick Bangert: Yeah.

Juan Sequeda: I mean, go back to all the healthcare stuff you were saying or," Hey, how can we drive more clicks or whatever?" Well, anyways, this has been a phenomenal conversation and we're already going over here. Let's go start wrapping this up and let's move to our lightning round, which is presented by data.world, the data catalog for successful cloud migration. And I'll kick it off first. So will we see knowledge graphs incorporated much more into the AI development projects in the next one or two years, or is that still going to be another three to four, five years? How much soon, short term versus medium term?

Patrick Bangert: Well, they're already being used, and I will say that they would increase in their use. Before you see them hit your desk at home, it might be five to seven.

Juan Sequeda: All right. Tim, you go.

Tim Gasper: Second question. You talked about a lot of the business value of, and of the revenue drivers around AI, and how some of that's shifting some of the landscape there. Will there be a business value anytime soon, and let soon be kind of broad here, from some sort of a general AI, or are we going to be living in more of a specialist AI for quite a while?

Patrick Bangert: We will be living in the world of specialists or narrow AI, I say, for the rest of my lifetime, possibly beyond. Artificial general intelligence, where one AI system has intelligence and breadth of capability of an average human being, that is a Hollywood pipe dream. I personally, my very own individual opinion is that we will never really get there. Some people believe that we will get there eventually, but I can absolutely say with certainty that this will not be anytime soon, because we're just way too far away from that. So anybody who's afraid of The Terminator, don't be, this is not going to happen anytime soon. Just look at where the current capabilities are and how far away that is.

Tim Gasper: That's interesting. That ties to your comment about the autonomous cars too. I think, like many others who have some misconceptions, I see general AI and then I'm like," Oh, wow, autonomous vehicles, we've come so far." And the answer is like," Well, we got a long way to go."

Patrick Bangert: Yeah..

Juan Sequeda: All right. Next question. Is there a bottleneck that we need more domain experts involved in AI, or is it just going to be we need more AI engineers, ML engineers?

Patrick Bangert: Well, you need both. That's for sure. And as we've talked about, the onus these days is more on the data side than it is on the mathematical side. We've mostly figured out the algorithms. We have frameworks like PyTorch out there to help us out on the technology side. Many, many companies are specialized on the pipeline. So I think you need... In case of doubt, you need more people on the data side, on the domain side, than on the engineering side.

Juan Sequeda: That's a great point.

Tim Gasper: All right, last question here. Lightning round. We talked a little bit about how AI can help AI and AI squared, as you mentioned. Will we get to a point where... I feel like today we're spending a lot of time talking about," Oh, data bias," and what is good data and things like that. Will we get to a point where that's actually a very simple and boring question because the AI is so good at helping us figure out what's good data, what's bad data, how to make it better, et cetera?

Patrick Bangert: Well, it's going to become easier, for sure, because questions like bias, especially in that cultural context, biased against certain groups of people, wasn't even a topic until a couple years ago. Now it is, and it's front and center. AI ethics is a big, big topic now that started out of almost nowhere, alongside explainability. That will grow at exponential rates. Those two topics are really the future of topics on the surroundings of AI. They're not mathematical questions, but they're very, very crucially important for the business ecosystem. And so, yes, you'll see those tools grow. At the moment, the tools related to bias and ethics are in their infancy, but there are multiple companies, especially startups that are going to be focused on this. And so, over the next two to five years, you will see a suite of tools being invented that today don't exist. And that will certainly help.

Juan Sequeda: All right. This has been a phenomenal conversation. Tim, T, T, T, Tim, take us away with your takeaways. Go first.

Tim Gasper: All right. Oh, my god, my brain is going in a million directions right now. I'm going to try to bring it back home, try to land it back on Earth. You started off, we started off today like, where is AI ML focused and where should it be, versus where it is? And you talked a lot about, and very interestingly and very excitingly, around where the money is around AI. Where's the business value today and where is that investment and that benefit shifting? And if focus means where the money is, then today you mentioned it would be in autonomous vehicles, and that more than half of the labor hours for AI is either directly or indirectly in autonomous driving. But that is mostly a sort of" solved problem" now. Obviously, there's tons to figure out around manufacturing and legal and compliance and ethics and all that kind of stuff, but if that's a solved problem, then the next question is where is that money going to go and where are those people going to go, all those people that are working on this problem? You mentioned military potentially being a prime candidate for this, better or worse. But more interestingly, and perhaps to the greater benefit to society, TBD, is around healthcare. And I think that that is a very exciting topic, and so you opened into, right now it's more sick care, how do we turn this into wellness and," Hey doc, I feel fine. How do I make sure I stay that way?" And the role that AI and devices and sensors and even classes of devices that we may not even really think about yet, perhaps like home devices where you take a drop of blood and learn about your blood sugar levels and all sorts of things, things that we assume are like," Oh, diabetics do that kind of stuff." But it's like," No, actually there's a lot of benefit for holistic health on an ongoing basis." That could be a huge benefit to society and perhaps a huge opportunity as we look at the near term to medium term for AI to play a very big role. I think that was very exciting to explore that. You mentioned, let's not confuse the technology around autonomous driving simply with Tesla. And so, we talked a little bit about how it's good enough already in a lot of its applications and all the different players that are going to be playing in that. Then when we looked at the healthcare world a little bit, we swing back to that, you had mentioned that especially the device manufacturers are probably going to play a huge role here, whether it's... For those that are just listening, I'm shaking my wrist here, I've got a wearable on my wrist. Think about wearable devices, devices in the hospitals. How, in general, can technology and software and hardware work together to turn doctors into folks that can spend most of their time helping and doing wellness, as opposed to being accountants and clerks and paper pushers? And I love this money quote," Today, doctors are robots, so how do we make them so that they don't have to be robots going forward?" Let the robots do the robot work. Tons of good stuff. Juan, over to you. What did you learn today?

Juan Sequeda: Let the robots do the robot work. All right, so technology, the focus today is so much on the algorithm, the mathematics, how many layers, transformers, and you twist on the algorithm," Whoo- hoo, I get published," whatever. But really, the accuracies are so good that the potential to improve is so small that we can even question ourselves, is it even worth spending time on improving those algorithms? And I think, clearly, the problem here is around what the data. I think this whole shift about thinking about the data- centric AI. So how do we get enough data, significant clean data, transformed properly, has the right features, the biases to present to the AI models, That's what we need to go do. And nine of the 10 AI products fail because of the lack of that good data. So I think this is a very clear, right now, where the industry is, we need to understand where the data is. We did discuss a little bit about preparing the data, how much of that is AI specific, versus just data integration stuff everybody needs. I think there's stuff that definitely is very data integration. Other things are very specific to AI, like selecting the right features. This is not an easy question to answer. And then we start getting to the experts. We need to bring experts around. I think there's two aspects where the experts come in we've discussed. One is labeling the data. This is a very expensive thing to go do. A lot of time money's spend into this, that what we've realized is that 97%, for example, of images are just duplicates and only the 3% of the unique ones, so how do I identify those 3%? And this is where techniques like active learning come in, with human- in- the- loop, that by just labeling the small amount of those images that the model's confused, which are actually the most informative one, by labeling those, you're actually getting to up to a 90%. And this is just an example of AI to build the AI, so AI squared. The other aspect we talked about was ensuring knowledge without those labels. This where ontologies and knowledge graphs come in. And even though a large section of the community thinks that we don't need this, it's just software with more data, I think the language models have proven this wrong. You can go chat with them and you can get like," Oh, it's great." I'm talking about weather, whatever, kind of seems like those pass the Turing test, but the moment you start getting the things about the times of day, about just common sense, it has no idea about that stuff. Because why? Because they didn't learn this logical reasoning. And this type of knowledge is really inefficient to present in the form of just data, but it is very efficient to present in the form of a knowledge graph in ontologies. And I think this is the step that we're going to go do. And one of the aha moments I had is that driving your car, that's easy. The knowledge around driving your car, it's not that impressive, actually. What's hard? To change the diapers of a baby. You said we think of autonomous vehicles as the most amazing thing, but that's the bottom of the barrel. It's more about what are the skills that you want to go do? So effectively, the solution's here around putting together data and knowledge, and that's something that I'm personally extremely, extremely excited about because, even us, the data catalogs should be data knowledge catalogs, it's what we really will catalog. And finally, we closed up with talking about funding and this comes from VCs and governments, a lot of it from VCs, and yeah, a lot of them are focused on companies, as you said, not businesses, but really companies are targeting users on entertaining things, funny things, but marginally not that useful. So kind of an existential question is, are you working on something that's actually useful for mankind or AI? Patrick, how did we do? What do you think we missed on our takeaways?

Patrick Bangert: Yeah, so I think that this was just a perfect example. We've heard Tim and Juan give a fantastic summary. If you, as the audience, can just imagine what kind of AI system would we need to make a summary of a 50- minute discussion in that fashion? And I'm telling you that, in my opinion, that will not be around for another three, four, five decades to come, a system that will produce it with that quality, that clarity. So, that's where the bar is.

Juan Sequeda: All right, well, 40 years from now, we'll go see where we are. Let's go past this exact same episode and see how close it will come to the takeaways that we just did, Tim.

Tim Gasper: When you started to say that, Patrick, for a second there, I thought you said that there would be an AI in a couple of... I was like," Oh, man, maybe we don't need to do this anymore." No, you're saying our job is intact for a while.

Patrick Bangert: No, I'm saying the opposite. I'm saying that the current level of AI is, I mean, compared to humans, to that general skill. I mean, you guys needed to have the ability to communicate, obviously, and listen and then speak, but you needed to have a lot of world knowledge and a lot of domain knowledge around these areas to be able to put those ideas together in meaningful sentences and so on that, again, just a purely natural language model just is impossible to do that. Just a knowledge graph, impossible. How big does the knowledge graph have to be to encompass AI and driving and home world and logic and economics and history and all of that, that both of you effortlessly hold in your heads, right? This is unreachable at the moment.

Tim Gasper: I have a new perspective.

Juan Sequeda: Yeah. All right. Well, Patrick, let's throw it back to you. Three questions. One, what's your advice? Second, who should we invite next? And third, what resources do you follow, people, blogs, conferences, whatever?

Patrick Bangert: Well, what's my advice? If you're interested in AI, read about AI, but don't try to learn Python or PyTorch or pick up a technical book with source code and formulas and mathematics in it. Unless, of course, you've got a year or two of your life to spare. If you do, then by all means take a math degree. But if you want to find out, so what is AI, and you have time to read a couple of books, pick up the more popular books with text and examples as opposed to with code. Learning a programming language takes a long time. Who should you invite next? Well, there are numerous other companies who have really, really great people. I can certainly give you a few names later. Some people that come to mind is Andy Hock from Cerebras Systems, or Liran Zvibel from WekaIO. These are really good people that have phenomenal understanding of the market. What do I follow? I am mostly on YouTube and LinkedIn as a consumer, where I listen to the latest opinions there. You will never see me on Instagram and Twitter and those resources. And my news feed in LinkedIn and YouTube provides me with the latest and greatest. So for example, I saw the little shaky robot of Elon Musk's just now, and I've seen way better videos coming out of MIT, to be honest.

Juan Sequeda: All right. Well, Patrick, this has been a phenomenal conversation. I think this officially is now the longest episode that we have ever done. Thank you so much. Next week we're going to have Laura Ellis, she's a VP of Engineering of Rapid7, where we talk about data teams. Patrick, again, thank you, thank you, thank you. We really dove into so many different aspects of AI and it was a truly honest, no BS, thoughtful conversation. Cheers.

Patrick Bangert: Thank you very much. Cheers. Thank you, Juan.

Tim Gasper: Thanks for joining.

Patrick Bangert: Thank you, Tim.

Speaker 1: This is Catalog & Cocktails. A special thanks to data.world for supporting the show, Karli Burghoff for producing, John Williams and Diane Jacob for the show music. And thank you to the entire Catalog & Cocktails fan base. Don't forget to subscribe, rate and review, wherever you listen to your podcast.

Catalog

Explorer

Marketplace

Governance

Workbench

Catalog

Explorer

Marketplace

Governance

Workbench

Financial Services

Healthcare

Higher Education

Insurance

Federal

State and Local Government

Financial Services

Healthcare

Higher Education

Insurance

Federal

State and Local Government

Data Leaders

Data Engineers

Data Governance Professionals

Analysts & Business Users

Data Leaders

Data Engineers

Data Governance Professionals

Analysts & Business Users

Integrations

API Documentation

Reference Implementations

Support

Integrations

API Documentation

Reference Implementations

Support

Snowflake

Oracle Database

Postgres SQL

Databricks

dremio

Snowflake

Oracle Database

Postgres SQL

Databricks

dremio

Blog

Events

Podcasts

Webinars

Reports and Tools

Blog

Events

Podcasts

Webinars

Reports and Tools

Who We Are

Our Team

Our Partners

Why data.world

Who We Are

Our Team

Our Partners

Why data.world

Press & Media

Events

Careers

Legal

Contact us

Press & Media

Events

Careers

Legal

Contact us

Catalog

Explorer

Marketplace

Governance