NEW Tool:

Use generative AI to learn more about data.world

Product Launch:

data.world has officially leveled up its integration with Snowflake’s new data quality capabilities

Upcoming Digital Event

Learn how WR Berkley & Singlestone Consulting supported this distributed model with modern data practices and a data catalog built on a knowledge graph.

View all webinars

Making Data Science More Accessible with Doris Lee

Clock Icon 44 minutes
Sparkle

About this episode

In this episode we will chat with Doris Lee on how data science started from an exclusive, hi-tech branch, how it's become more accessible over time, and what's still missing. We will also go over the data science market and the role of open source.

Tim Gasper [00:00:00] It is time once again for Catalog& Cocktails. It's your honest no- BS, non- salesy conversation about enterprise data management presented by data. world with tasty beverages in our hands. I'm Tim Gasper, longtime data nerd, product inaudible at data. world, joined by Juan Sequeda.

Juan Sequeda [00:00:17] Hey, Tim. I'm Juan Sequeda. I'm a principal scientist at data. world, and as always, it's a pleasure. It is middle of the week, end of the day, and time to finally have a drink and chat about data, and I'm super excited. Doris Lee is here today, and we've been trying to get Doris on the show for a long time. We finally met over the summer at Snowflake, and I've been following a lot of the stuff that she's been doing because it's like, I love it when we have academics who are doing really awesome stuff and they're pushing it to startups and we need more of this stuff. I think. Doris, it is super, super awesome to have you former CEO and co- founder of Ponder, which was recently acquired by Snowflake last month. How are you doing?

Doris Lee [00:00:59] Good, Juan and Tim. Super happy to be here. Super excited for our conversation today.

Juan Sequeda [00:01:05] Fantastic. Let's kick it off. What are we drinking and what are we toasting for today?

Doris Lee [00:01:09] Oh, well, it's early afternoon here in Pacific time, so it's just a cup of tea for me on my desk and I wish I was sipping a glass of Piña Colada on the beach, but not today, not today. But yeah, I'm toasting to a great conversation ahead of us and really excited to chat with you all today.

Tim Gasper [00:01:31] Awesome. Well, excited about it as well. We actually have our all hands week over at data. world. So, the whole company is in town in Austin, Texas, which is a lot of fun. That means the bar cart is full and ready and I am drinking a tequila and blackberry lemonade soda and then what are you drinking?

Juan Sequeda [00:01:50] I'm drinking just a traditional good old Highball, just whiskey and soda, and it's a Johnny Walker Red and some Topo Chico and this is nice and refreshing because we'll have a lot of fun activities this week. So, cheers at getting people together and for great conversations.

Tim Gasper [00:02:06] Yeah. Cheers, Doris.

Doris Lee [00:02:07] Cheers.

Juan Sequeda [00:02:09] All right, so we got our warmup question today. So, today, we're talking about how to make data science more accessible, more intuitive. So, what is something, it could be either a product or a service that you find really intuitive and usable, or you can flip it around, something you don't find intuitive and useful?

Tim Gasper [00:02:25] Something that's terrible.

Juan Sequeda [00:02:26] But it is needed.

Tim Gasper [00:02:27] Yeah, yeah.

Doris Lee [00:02:28] Yeah, I think this is completely not data related, but Slack is, I just love using Slack. We use Slack as part of Ponder. Slack is amazing, and I think what Slack did was it just raised the bar for what we thought of workplace messaging. Workplace messaging was just like, emails and I don't know, Microsoft Outlook or something. It's just this really boring communication channel, and I think Slack had just made it really fun for people to message each other. It was a huge part of our culture, having emojis and everything. So, it just made it really fun. I don't know if you all know about the story behind Slack, but I think it started off being these founders trying to create a gaming company, and then at the end of the day they're like, " Oh, this is a great messaging platform," and they pivoted the whole company to be a messaging platform. So, it's super cool

Juan Sequeda [00:03:27] Now that you say that, I'm like, yeah, that's actually pretty, I really enjoy Slack.

Tim Gasper [00:03:31] Yeah.

Juan Sequeda [00:03:32] I think there's people who are like, there's different sides of the fence and stuff, but I like it. It is very intuitive.

Tim Gasper [00:03:39] Yeah. Well, data. world is the first company I've worked at where actually I think we spend more time in Slack than on any other communication medium, which I find very interesting.

Juan Sequeda [00:03:49] Interesting, but you're not a big fan?

Tim Gasper [00:03:51] Well, I don't know. Well, I mean, Slack is all real time. There're some days where I'm like, " I need to turn this off." It's too much, but it's very intuitive.

Juan Sequeda [00:04:00] How about you, Tim? Do you have a product or service you find really intuitive?

Tim Gasper [00:04:04] So, I was thinking about this question and I was like, " Oh my gosh, how am I going to answer this question?" This is sort of a cop out answer because I mean, we've been talking about it almost every week. But I've been really astounded by how many different ways that I can use ChatGPT. Natural language, I dig it. I want more natural language. The other day, I was emailing somebody and they were like, " Hey, I'm available this time to this time and this time to this time and this time to this time," and it's all UK time and I'm like, " Oh my God, I need to do all the translations." I'm like, " I wonder if I just copy and paste this in the ChatGPT, is it going to translate the whole thing?" And of course, it did. Of course, it did. So, anyways, I love natural language. ChatGPT has been great

Juan Sequeda [00:04:43] Yeah. inaudible. I was thinking about this as what was something that wasn't intuitive and usable that it did become, and I think my answer is about maps. Now, your Google Maps and all that stuff. I remember the day before it was printed it out on MapQuest and stuff, that really changed the world.

Tim Gasper [00:04:57] Yeah. Now you can only get lost if your cell connection is bad, right?

Juan Sequeda [00:05:00] Yeah, yeah. All right. All right, let's kick it off. Doris, honest, no- BS. How do you make data science accessible? What does that even mean?

Doris Lee [00:05:10] Yeah, so I think the funny thing is data science has seen a lot of shifts and changes over the past decade or so. I don't know if you all remember, I think early 2010- ish, Harvard Business Review, they had this famous article that talked about how data science was the sexiest job in the 21st century. People kept talking about that. There was a huge buzz around data science being a huge game changer. Now, I think data science as a field has evolved a lot since then. We've seen tooling and just the field evolved and at that time, the 2010- ish era, it was still very nascent. Tooling was not standardized, and essentially data scientists had this really hard job of like, " Hey, you need to be an expert in your domain." So, maybe that's finance, maybe that's biology, genomics, whatever is your domain. You need to be an expert in statistics, machine learning, programming, you need to know how to program. You also need to know how to work with big data. So, oftentimes, that means you need a CS degree or you've done software engineering in the past. And so, where do you even find these unicorns or these really rare unicorns in the market? Many of them typically also have a PhD in the sciences, that could be astronomy, physics, genomics or whatnot. And so, I think that when we think about data science, at that time, it was a field with a very high barrier of entry. And a lot of my work over the past five years and even before then was really around how do you lower that bar? How do you make it easier so that anyone can do data science, explore their data and really understand what's going on with their data? Because I think there is a tremendous amount of value for domain experts, people that are specialized experts in their domain, and to give them the power of working with data because they can see the data in a lens that us just as software engineers or computer scientists, we see it in a different lens just because we aren't in those domains. We're not a genomicist, we aren't a domain expert in finance, for example. And so, that is one of my dreams is what if everyone in the world, whatever their specialty is, can work with data? And so yeah, it was probably a very long answer, but basically when I think about accessibility and data, it's about enabling that 99% of the population to be able to work with the data.

Tim Gasper [00:08:08] I love that. I mean, I think that's the dream, right? Is that we don't want people to have to have all this specialized technology and tooling knowledge just to be able to answer questions about their business or about how things are going. And I do wonder though, do you see a difference between lowering the barrier for data science versus lowering the barrier for business intelligence? Or more basic kind of questions for data? Do you see those as two different things or are they actually the same thing when you look at them?

Doris Lee [00:08:46] Yeah, I think, Tim, you're definitely right that there's this spectrum of when we talk about data science, BI, I think all of them are kind of in the same spirit of like, " Hey, how do we allow more people to access, understand, visualize their data? Actually, early on in a lot of my PhD work, I focused a lot around what you just talked about, which is low- code, no code kind of use cases. So, your standard BI charting tools where you have an interface, you're clicking through various different buttons to create a chart. And I think one of the best examples of that, two tools that comes to mind is spreadsheets. Microsoft Excel did a really great job in converting raw, boring numbers in a grid to something that you can actually work with, with your mouse clicks and everything and create powerful formulas, calculations. So that is, in my sense, a very successful tool that enabled a large number of white collared, anyone who is working on a computer to be able to work with their data, their data being any table of numbers. The other tool that did a really good job in this is Tableau of BI tools. Tableau not only said, " Hey, we have this grid of numbers. Let's convert it into very beautiful charts that actually tell a story." And you don't need to know any programming languages. You don't need to be an expert in charting. You can use Tableau, point and click drag and drop and bam, you have a visualization. So, those, I think, are early examples, again, around over the last 10, 20 years where data has become easier and easier for people to work with and I think now we're in an era where people are looking for more complex analysis forecasts, even running machine learning on their data. All of those things, I think, are part of these more advanced machine learning and data science use cases. And I think we're in a new era where we need better tooling to allow the 99% of the spreadsheet users, the BI users to be able to use and access those technologies.

Tim Gasper [00:11:17] Yeah. Let's take the next step.

Juan Sequeda [00:11:19] So, you stir it off, kind of giving, call it the original data scientist definition. It's like this big unicorn with these overlaps.

Doris Lee [00:11:29] Yeah.

Juan Sequeda [00:11:29] I mean, statistics programming, big data, machine learning experts and inaudible, right So, 10 years later, what is the definition of a data scientist today? I mean, it's not that big. It's not that unicorn. So, what is it then today and how has that evolved?

Doris Lee [00:11:45] I think the line has definitely blurred between what we traditionally called a business analyst or a data analyst to data scientists and even the new term machine learning engineer. There is a spectrum of what people are doing across all of these different fields. I think as we sort of lower the bar in terms of what it takes to be doing data science, we're going to see more of the blurring of the line. One example of this is actually in a lot of financial use cases like financial banks, quants, people aren't moving away from spreadsheets, but they're spreadsheet users, people who are typically spending 90 plus percent of their time in spreadsheets. They're now learning Python. They're now learning data science so that they can do more with their data. They could do time series forecasting, they can do summarization of their data and so on. So, that's one example of where I think in the future if data science do become more and more accessible, we're not going to have the title a data scientist. Everyone is going to be a data scientist. You might be a biologist. Your title might be like, " Hey, I'm a medical doctor, I'm a biologist, but you are a data scientist too."

Tim Gasper [00:13:06] Interesting. So, maybe, the data science sort of role was born around this specialized skillset originally, right? And so, folks that could learn that skillset could step into those shoes, but you mentioned some different words here. You mentioned statistics, programming, big data, machine learning. Basically, is the goal to, as each one of those things becomes more accessible, then this idea of the data scientist becomes less unique. You're imbuing the power of the data scientist onto the business analyst onto the program.

Juan Sequeda [00:13:44] Well, to add to this, I'm going back to the word accessible here. Is it really about the tools that are lowering the barrier, or is it more about some of the education or, I mean one or the other? Because I think we've been using spreadsheets forever. And then, people would argue like, " Well, I'm doing data science. Well, but you're doing things in the spreadsheet, but you are still generating the same outcome. I happen to using a spreadsheet and I'm going to call myself a data scientist." That's fine. I mean, that was a tool that I used. And I guess, more and more tooling is make it accessible, but also how much of it is actually the training versus understanding how to go do things with data? I'm curious to get your thoughts here.

Doris Lee [00:14:23] Yeah. I think it's definitely a mix of both. You need both the education, the knowledge around how do you think quantitatively, how do you reason with data, what are common statistical fallacies that you should be aware of when you're presenting data? All of those educational elements are super necessary. Even when we have automated tools that education, that training is still extremely important. And then, I think, the other aspect of this is when I think about accessibility and in particular how it pertains to my work, it's often around, probably if you take all of the keywords I talked about earlier, more around programming aspect and also the aspect around like, " Hey, you don't need to have a PhD in Computer Science to be able to do that." So, more around the computing aspect and how we can lower the bar there because it's pretty rare for someone to have both a PhD in Biology and a PhD in Computer Science. And so, the goal is that how can we allow everyone to be able to work with their data?

Juan Sequeda [00:15:34] One of the, I see a synonym basically of data science when we think about tooling technology computing is Python. Is that still, to do data science, you at least need to know Python as a language to go do things. And actually, to broaden out the question is, per your perspective, how have you seen something like Python come in and change the data science landscape?

Doris Lee [00:16:00] Yeah. Python has a really interesting relationship with data science. I think it was interesting because Python was one of the first languages that I learned for programming. And it was just so accessible in the sense that if you ever learn other programming language like Java or C ++, the reason why Python is really special here is that in Python it's a very high- level language. So, you don't have to really think about types or memory allocation or any low- level details about the computing stuff, but it allows you to really focus on what you want to do with your data. So, you're not really boggled down. You can just specify like, " Hey, this is what I want to do." And this is really important as it relates to data science because when you're creating data pipelines, data workflows and so on, you're always trying to iterate really quickly. No one ever writes end- to- end data science workflow just from scratch without ever trying things. You're always running stuff and then going back, changing something and then running stuff again. So, there's a bunch of trial and error. So, data science is inherently very iterative. And so, Python has some really nice element that couples well with that. One is that high level aspect that I just talked about. I think the other is the fact that it's very interactive and the interactive aspect actually comes from the fact that Python is an interpreted language, which means that when you're running a Python code, it's not like Java or C where you have to compile it first and then run it, compile it first, and then run it. So, there's kind of this step between compilation and being able to see your results. In Python, because it's an interpretive language. You can just run it. You can even run one line at a time. So, you run a line, you look at your results, and then you run the next line, you look at your results. So, it's very easy to actually inspect your result and then figure out like, " Oh, okay, this is the next line of code that I want to write." And so, that has led to the development of things like IPython and Jupyter Notebooks, which is a very interactive development environment for data scientists to be doing data cleaning, transformation all the way until visualization, all in a single development environment. So, I think those two aspects that are very unique, I mean, it's not completely unique to Python, but it's one of the selling points of Python has made it very attractive for data scientists to say, " Hey, this is a really easy way to get started." Especially coupled with that fact that we talked about, which is most data scientists or the people that want to be working with data aren't really coming with traditional computer science background, which means that if a language like Python is really easy to pick up, it makes a great kind of starter intro to data science or intro to data science course, which is what we've been seeing in the last five years, which is a bunch of these data science bootcamps, Python bootcamps aimed at upskilling and helping people get into data science.

Tim Gasper [00:19:27] Yeah. I like that example you gave of quants at a bank and maybe you know how to use a spreadsheet. Learning Python now can take what you're doing to the next level. And it's clear that notebooks have become very, very popular. And the popularity of Python obviously has a lot to do with that as well. I think it's interesting how there's sort of three trends I see happening. And I'm curious, Doris, what your thoughts are on this are. There're data science features being added into places like Excel and Tableau and things like that. Then you've got specialized tools like Dataiku and DataRobot and things like that which are trying to be low code, no code types of approaches to managing machine learning and things like that. And then, you have sort of notebooks and Hex and Jupyter and things like that, and really more like, " Hey, you should learn Python and Python's easy." Python is accessible as sort of the third category here. Do you see that one of these categories is going to kind of takeover or do you see that there's space for all three of these as data science gets more accessible?

Doris Lee [00:20:29] Yeah, Tim, I think that's a really interesting way of putting the three categories. I never really thought about the categorization of these, but I think that is correct, that one is the low- code, no- code tool," and then enhancing that with Python capabilities. We saw that with the Microsoft Excel feature with Anaconda. The fact that you can actually run these Python formulas or these Python sort of procedures within Excel. And then, also, your typical, I think of DataRobot and Dataiku, Alteryx maybe also falls into that mix of maybe an auto ML tool, but also, it's doing a little bit of data prep. It has a little bit of some of the capabilities that you would expect from a data platform, and then kind of the notebook. Which I still think that notebooks is still focused more on the programmatic audience. When I say programmatic, you're writing code to do something to your data, you're not doing point and click and so on. Now we are seeing the blurring of those different lines with, for example, Hex has a really nice interactive active sort of panel that you can actually look at a dashboard that is based on your notebook and so on. So, I do think that we're seeing a blurring of the lines across these three categories, dashboards to automated tools like AutoML and notebooks and so on. And I think it serves different personas and audiences. And one of the funny things about data tooling is that I think if you try to pack too much in a tool, you'll never please everyone. And so, I do think that each of these categories of tools serve their specific markets very well.

Tim Gasper [00:22:27] I think that's a very astute answer. And yeah, it's pretty hard to build a spreadsheet, BI notebook, AutoML tool, right? That's just, the scope is too large, and how would that user experience even work anyway, right?

Juan Sequeda [00:22:42] And they're different personas and users.

Tim Gasper [00:22:44] Yeah, exactly. Yeah.

Juan Sequeda [00:22:45] So, one of the things that, looking at these three categories, I'm surprised, or actually where does SQL fit into this? And I want to broaden off the question is we're talking about data science, but then also, where does data engineering fit into this? Because I feel that there's so much overlapping work that occurs, right? You said it earlier, it's like, " Oh, data scientists, they clean the data." Well, isn't that something that now the data engineer is doing and all that work is also happening in SQL, but then people do the data scientists go do all this, write all this Python code where you're like, " All you just did was a joint. That would've been a SQL query." So, you both talked about these three categories. I'm like, so where does SQL fit into all of this? And then, how does data science work fit into the data engineering work and how are the lines getting blurred over there?

Doris Lee [00:23:39] Yeah, I think...

Juan Sequeda [00:23:42] I just dumped a bunch of stuff of the things.

Doris Lee [00:23:47] So, in terms of Python and SQL, I mean SQL was developed when relational databases were developed. It was developed because people wanted the declarative way of saying, " This is what I want to do with my data. And then I have this very intelligent query optimizer that figures out a query plan and actually figures out a plan to execute it on my database." That's why SQL was developed in the first place. It was called Structured Query Languages. And SQL was designed so that it had English phrases in it, such as select and from and where. So those were English like clauses, and it did really well. People were able to use SQL. Data analyst, data engineers over the past two or three decades have been using SQL to work with the data in their relational databases. I think one of the reasons why Python and other languages also rose over time was the fact that one, not everyone puts their data in a relational database. Data can be in spreadsheets. It could be images, it could be documents, it could be all sorts of different things. And if you can't fit it into a relational database, you can't use SQL on it. Now, the second reason why Python has also gained popularity is also the growth of machine learning and machine learning and other advanced data science workflows that people want to do with their data that traditionally would not fit very well in the SQL type of world. Now, I think we are starting to see a blurring of the lines here with a lot of cloud data warehouse companies actually offering Python native APIs and solutions on top of their databases. So even today, if you're a Python user, it's not just about if you have your data and your database, you don't have to write SQL to be able to work with the data, maybe you're a Python user. And the reason why we're seeing this shift in the market is if you actually look at the growth of these programming languages, Stack Overflow does this survey every year where they survey, I think basically, I think programmers, I can't remember exactly what's the population here. But essentially Python, I think this year or last year was the first year that Python took over SQL as I think the third most popular programming languages. So, both of them are hovering at around 48, 49%. So basically, one in every two developers are using Python or SQL. I do think...

Tim Gasper [00:26:38] What was that? Python and SQL are neck and neck right now?

Doris Lee [00:26:41] Right. Yeah, right.

Juan Sequeda [00:26:43] What's number one?

Doris Lee [00:26:44] I don't know actually.

Tim Gasper [00:26:47] Okay, we got to look it up quickly.

Juan Sequeda [00:26:50] Okay. So, at the end of the day, the lines are fully blurred here. I mean, it's like you just work in data. What does that mean? Well, everybody's definition of data is going to be very different. I may call myself a data engineer, but I'm like, I'm doing already some work and somebody else, they're calling themselves a data scientist. So, I think that all these roles and titles are just being very blurred. So, I mean, that's an observation here out of this conversation.

Doris Lee [00:27:22] Yeah. Well, I think my hope is that, I hope that in five years or 10 years, the language that you pick doesn't actually limit you to what you can or cannot do with your data, which is true today where, okay, without the Python APIs on databases, the Python stuff all lives in Python land, all the database stuff, all live in database land. And my hope is that in the future, there are APIs that are sort of agnostic to whatever backend your data is stored in. Maybe it's a data lake, maybe it's a database, maybe it's something else. Maybe it's a bunch of unstructured data or something like that. And you can use Python, maybe if you like SQL, maybe you like Julia or whatever is your programming language of choice, you're able to work with that data and then the platform just figures it out for you, right? It figures out what needs to be done to your data. To make data truly accessible, that's where we want to be.

Juan Sequeda [00:28:32] I'm going to put by my same head as always, and I agree with you. I'm like, at the end of the day, this shouldn't be like, " Oh, so if you're going to go use tool or tool language or whatever, a, you got to do it this way," because that's how we generate silos about this stuff. So, it doesn't matter. At the end of the day, you just use the tool that is most comfortable to you. But for it all to work out the data needs to have your well- defined, meaning this is where I think the semantics are going to play and the knowledge, and this plays an incredibly key role in order to accomplish that vision. Anyway, that's my perspective. I'm curious what you think about that.

Doris Lee [00:29:09] Yeah, I definitely agree, and I think a lot of the work around data discoverability and obviously catalogs and others is important to enhance your data with that semantic information, that semantic layer. I think there's a lot of exciting work, cutting edge work that's being done over the past couple of years. And nowadays, on this front.

Juan Sequeda [00:29:34] I'm looking at the question here that Malcolm has. So, this is the type of work, so you're saying you have data scientists who are like, are they using the outputs of these existing data management tools? So, " Oh, tools that are already doing master data management, they're doing data quality, they're doing all the semantic and stuff or are they taking it from the source and then they're doing that themselves, that work, and maybe even repeating it over and over." I see both sides.

Tim Gasper [00:30:02] A little bit of both.

Juan Sequeda [00:30:03] A little bit of both. Again, what's your perspective? How do you see this today and where should this go?

Doris Lee [00:30:10] I think it's a little bit of both. Definitely, it depends on the workflows of the data practitioner. It depends on the organizations that they work with and what workflows they have. We do find that a lot of data scientists like to use their own tooling, but also oftentimes that is dictated by, " Hey, what warehouses are all your data stored in?" Or, " Hey, what's the typical workflows?" So, I think it's a little bit mix of both. And I think for the data engineering use cases specifically, we do see a lot of people writing data pipelines and scripts to pull all of that information together in house.

Tim Gasper [00:30:53] Yeah, that makes sense. One other thing, Juan, you brought up was data science versus data engineering. And I'm wondering, when we talk about making data science more accessible, are we also talking about trying to make data engineering more accessible? And Doris, how do you feel about how those two fields are evolving together? Is this another area where the lines keep blurring?

Doris Lee [00:31:22] Yeah, I think traditionally, data engineering has been focused on the ETL use cases where I'm taking my data, I have to do some sort of transforms operations typically in SQL, and then get it to a format that is clean that my BI analyst can plug in their Tableau and look at the data. So that's kind of your typical workflow. Now, a lot of things are changing because of this trend towards the modern data stack where you have an entire sort of stack based on data that is in your warehouse. And so, I think some of that is evolving, but I think that between data engineering and data science, one of the primary handoff process that we do see is that, a lot of the times, a data scientist would go in, they would develop some sort of, let's say a fraud prediction workflow based on some machine learning package. Now, this would be developed on a Jupyter Notebook in maybe a single node on my laptop, and it works. It's like, " Okay, I have 98% accuracy. I'm done for the day." But now a data engineer needs to be tasked with a job of like, " Hey, I need to now take that notebook, translate it into something like Spark or rewrite it into SQL or something that is a little bit more robust so that we can actually deploy that workflow into production pipelines." And that friction point is something that we see a lot. And it's a huge pain point obviously because you're doing the exact same work, but you're having to retranslate or rewrite those workflows into what they're calling a more robust language. And so, that's one area where we've seen that data science to data engineering handoff process being pretty high on friction.

Juan Sequeda [00:33:24] This is a very interesting point you're bringing up. And what's going through my head is the way I see this is data engineering is one, as you said, the ETL, ELT, whatever ends up in your database, data lake, all that stuff. And then, you want your analyst and also your data scientists to go find their data to go do their stuff. So then, they go find the data and they go do the work as you described. They're like, " Okay, I'm done. I did it on my laptop." But then there's this cycle that it goes back saying, " Okay, now we need to put this into production, but this isn't just about data engineering. Go add it back to the data lake." I mean, it could be something like that, right?

Doris Lee [00:34:02] Yeah.

Juan Sequeda [00:34:02] But also, I think now the follow- up question is we talk about data science and data engineering is like, so where does the data science and the ML engineering come in? There are all these cycles that go around. So first of all, the point I want to make is it's really interesting how you're bringing up that the output of a data scientist after they've done their work, that comes from the work that a data engineer did. The output data scientist goes back inside to the data engineering, and you have this circle and there's that friction there. So, that's a great observation. I never thought about it that way, so thank you. But then that also leads me to think it's like, so where do these ML engineers now go in and where's the friction with all the data science and data engineering and ML engineering and whatever more roles we're going to come up with?

Doris Lee [00:34:44] Yeah, I think the use case that I referred to earlier, probably is closer to what one would call, like a machine learning engineer. Machine learning engineers don't exist at some companies, so they're just called data engineers. So, it's kind of synonymous in that realm. But the idea there is someone needs to take these models, these pipelines and put them into production so that they can be served, run in a scalable manner.

Tim Gasper [00:35:13] Here's an overly simplistic question for you. You're growing a new data team. Are you going to hire a data engineer first or a data scientist first?

Doris Lee [00:35:26] I guess, my question is what kind of data team, because it depends, right? Is the key challenge of, let's say, I'm building a product or I'm building a data- driven product, is the key challenge here or the key innovation here, like a model breakthrough, like a breakthrough in how we can do the modeling on this specific type of data? Or is it we know what the model is, it's a very simple task. We are able to replicate this, but we need to serve it to a billion customers. That's the two questions I would ask. And that fork determines that team composition and who do you hire first and what is the roadmap going forward?

Tim Gasper [00:36:11] So, the more novel it is that you're trying to do or discovery driven, then you may need, I mean, it depends on what kind of problem you're solving. If you're solving on the more simple side, it might be an analyst. On the more complex side, it might be a data scientist or machine learning engineer, something like that, right? But if you're, let's just say you're like, " I've got customer data over here and I've got customer data over here and I want to combine it together and I want to stick it in Snowflake so I can make a dashboard." Well, then maybe y ou're talking a data engineer or something like that, right?

Doris Lee [00:36:40] Yeah, yeah.

Juan Sequeda [00:36:42] All right. So, the point is that probably you're always going to start with more on the data engineering side because you need to layer, do the basic stuff first, and then later on, you're always going to have a data science. I mean...

Tim Gasper [00:36:55] Well, I don't know. Now, I'm mixed on that because the other takeaway I took from you, Doris, was that a lot of times the data scientists are the trailblazers, right?

Doris Lee [00:37:04] Well, it depends on what your data is and where it sits, right?

Tim Gasper [00:37:07] Yeah. Well, that's another T- shirt for the store. It depends.

Juan Sequeda [00:37:14] inaudible It depends. Actually, that's a call- out to our good friend Sanjeev Mohan, his show is called It Depends, because you know what? That's probably the answer to most questions. It depends and its hybrid. It's a hybrid role. It's a hybrid structure. Well, so I'm curious, you've worked a lot of the open source world. What is the market of data science tools evolving? What's coming up next and what's the role of open source around all of this?

Doris Lee [00:37:48] Yeah, I think open source is like, when I first discovered open source in, I think when I was an undergrad in Berkeley, I was like, " Wow, this is a brand- new world. This is crazy." Because people are developing these tools based on the challenges that they have at work or something, and they're like, " Hey, I'm going to go and build this tool and it's going to solve this pain point that I have." And these tools have over time, gained such a huge community followership. I'm talking about examples like NumPy, scikit- learn, Matplotlib, these things that have become standard tooling in data scientist toolbox. They are developed by folks that just decided one day like, " Hey, I have this pain point. Plotting is very difficult. What if we had a library that made that easier?" Or like, " Hey, numerical computing." I believe that that was developed by either a group of physicists or a group of scientists somewhere for scientific research. This whole ecosystem of what they call Pydata Ecosystem, like a Python Data Ecosystem really have helped Python create these higher- level abstractions and APIs for what you want to do with your data. That might be data transformation, cleaning in the case of Pandas, or it could be running machine learning models. In the case of scikit- learn, maybe you're using stats model to compute some sort of statistics. So, it's a really amazing avenue where we have these very rich abstractions, APIs that are created, which makes it easier for people to create complex workflows. And so, we've started seeing this explosion of tooling and just open source projects. And I think the ecosystems like GitHub and other community resources that definitely help with that growth with collaboration. And I think one of the things that I've learned over time is that you might have 10 different tools trying to do the same thing, but usually one or two tools went out. After there's a bunch of exploration in this space, users end up gravitating towards one or two tools. And oftentimes, those are not the most complex, the most technically elegant, the fastest or the most performant tools. It's really like, " What's the easiest thing to use? What's the easiest getting started experience?" And then, I think, the other aspect of it is open standards. We've seen time and time in the data tooling space that open standards, open source wins out over time than proprietary solutions. Because as a data scientist, I want a lot of control and understanding of what exactly is being done to my data. You wouldn't want to create a pipeline where there's a giant black box and you're like, " Oh, I don't really know what it does, but it spits out a 90% accuracy." That's not great, because eventually, you want to be able to tune, you want to be able to, I think of it as you're fixing a car, you want to understand what needs to be done to your car, the gears that needs to be changed and everything. But if it's all taped up and everything, then you can't make any changes. You can't optimize for the performance, and so on. And so, I think open source and open standards provide data scientists that peace of mind that they can always go in and make changes, make modifications and improve on their pipelines and so on, which is why I think over time, open source and open standards have just blossomed over proprietary solutions, especially in the data tooling space.

Juan Sequeda [00:41:54] So, two things. One, can you give an overview of, I think the thing you talked about, right? Your NumPy and Pandas, stuff like that. This is all within the Python Ecosystem. What are the categories within the Python Ecosystem? Like, " Oh, I'm trying to go do A," and for A, you want tool X, right?

Doris Lee [00:42:16] Yes.

Juan Sequeda [00:42:17] What does that landscape look like?

Doris Lee [00:42:20] Yeah, so I think of it as I'm a data scientist and what's my typical data science workflow? So, typically, I would start with maybe a CSV file. Let's just take a CSV file. I need to load in my data and then I need to do some sort of transformation on my data. So, for that, you want to use Pandas because Pandas comes with a very convenient data loading data, transformation data cleaning functionalities. So, you use that to clean up your data, you do their transformations. And then, you're like, " Hey, but I need to compute some sort of statistics." Or" I need to run a machine learning model," and so on. And then for that, the standard tool here is like scikit-learn. You would use scikit learn to train and fit a model. You create your train test data, and then you run the training, you do the prediction. And then, you use something like Matplotlib or Seaborn or Altair, these visualization packages to then say, " Okay, I have all of these model results." It's like 010101. It's all binary numbers. How do I visualize it? How do you understand what's going on with the model? And then, you would use something like one of these visualizations library to do that. So, that's a very simplistic view of kind of an end- to- end workload. Obviously, depending on maybe you want to use more complex models, maybe you would want to use XGBoost for training your decision trees and so on. So, there's all these for...

Special guests

Avatar of Doris Lee
Doris Lee Co-founder of Ponder (acquired by Snowflake)
chat with archie icon