About this episode

New data tools drop daily, but are they worth the hype? Some launch with overinflated expectations, while others solve a problem that doesn’t really exist. On the flip side, there are tools that transform the enterprise and have the potential to change the way we do data forever. The question is, how can you tell the difference? 

Special guest Erik Bernhardsson, formerly with Better.com and Spotify, joins Tim and Juan for a conversation on data tools; the good, bad and the ugly.

Special Guests:

Erik Bernhardsson

Erik Bernhardsson

Bernco

This episode features
  • An overview of data tool types and who they serve
  • How to define “good, bad and ugly” 
  • What’s your favorite spaghetti western movie?
Key takeaways
  • Observe: get to know your team and their existing workflows
  • Tools: what is in your tech stack, and what do people want to use?
  • ROI: focus on key business outcomes and objectives

Episode Transcript: 

Tim Gasper:
It’s Wednesday and it’s time for Catalog and Cocktails. It’s an honest, no BS, non-salesy conversation about enterprise data management. I’m Tim Gasper, longtime data nerd and product guy, joined by co-host Juan. Hey Juan.

Juan Sequeda:
Hey, I’m Juan Sequeda. I’m the principal scientist at data.world and it’s Wednesday, middle of the week. End of Wednesday and it’s time to take a break and chat about data. I’m in a different place today. I’m in our office. I think this is the first time… We’ve done an episode one or two here in the office.

Tim:
We did one episode there, yeah. But you’re in one of the “phone booths” right?

Juan:
Exactly. Yeah. We were going to eventually actually create a Catalog and Cocktails room here at the data.world office. So this is going to be cool to go do that. And well, today we have a really cool guest. And this is a guy who, if you don’t follow him on Twitter, you better be following him on Twitter. I just love seeing everything that he is tweeting about because he’s truly asking the hard questions. It’s an honest, no BS guys. And this guy is Erik Bernhardsson. He is a very well known. He’s built the music recommendation system at Spotify. You ran and grew the tech team at better.com from one to 300, I think. And now you’re working on something new. How are you, Erik?

Erik Bernhardsson:
Hi, so excited to be here.

Juan:
Well. It’s a pleasure. Thank you so much for accepting invitation. So, hey, before we dive of into the real deal here, what are we drinking and what are we toasting for? You want to kick us off, Erik?

Erik:
I just grabbed some beer in a bodega on the way here. So I’m drinking Goose Island. I can’t drink liquor this time of day. I would just fall asleep.

Juan:
It’s 5:00 New York, right? So it should be okay.

Erik:
Yeah, it is, but I don’t know. I’m old now. If I drink liquor, I’ll have to go to bed within three hours.

Juan:
How about you, Tim?

Tim:
I am drinking a scotch and ginger, keeping it simple today. And for my ginger beer, ginger rail, I’m using Bundaberg. It’s Australian owned. I don’t know if it’s made in Australia, but Australian owned ginger beer, pretty tasty. It’s got a little bit of a funkiness to it, which is kind of interesting.

Juan:
You want a bit of fancy? So I’m in the office, and we have a bunch of whiskey and we had some Mexican whiskey, which is Abasolo. So I’m calling this a Mexican highball. So Abasolo whiskey from Mexico and Topo Chico. So that’s my drink. So let’s toast for… What do you want to go toast for? Erik, what are you toasting for?

Erik:
I don’t know. Getting old.

Tim:
Cheers to being old, but not too old.

Erik:
Cheers.

Juan:
Actually, so cheers to being old, but not too old because we got our warm up question today, which is going to show us that we’re not old because-

Tim:
Maybe not old enough, huh?

Juan:
Not old enough. So what’s your favorite spaghetti western movie? I don’t know. I guess I should watch one day The Good, The Bad, and The Ugly, but-

Erik:
I think that one is good. I feel like I watched it, but I don’t know. I really don’t know. You’re asking the wrong person.

Tim:
You don’t watch any either, right, Erik?

Erik:
I don’t watch anything. [crosstalk 00:03:10] It’s like, I don’t know. I have like zero time to watch anything.

Tim:
Yeah. I was looking up what are examples of spaghetti westerns? And a lot of the names sound familiar. I was like, oh, A Fistful of Dollars? I’ve heard of that. Never watched it, maybe I should.

Erik:
Yeah. They’re all made in Italy. Right? That was the thing. Right? Because like-

Tim:
Yeah. A bunch of them were made in Italy. Yeah. And it became just like the style, they’re so bad, it’s good kind of thing. Right?

Erik:
Oh, is that right?

Juan:
All right. So now we know. We got some homework. Eventually, one day we’ll watch some spaghetti western movies.

Erik:
It’s not musical.

Juan:
All right, well, let’s kick this off. So hey Erik, so there’s so many data tools right now. And so many tools is exciting because, hey, there’s a demand for this stuff. But this also means that a lot of tools out there are not satisfying a lot of the needs. So honest, no BS here, which are the tools that suck and which are the tools that don’t suck right now?

Erik:
Yeah. I mean, I sort of has fear to throw a tool under the bus. I’m kind of an anarchist. I think in a way, you should just let people use whatever tool they want. And the fact that they use that tool probably means is good. People using things probably means they derive some value from it. And so that tends to be for me what I look for, do people enjoy these tools? So what’s bad, is I think to what extent data teams are wasting so much time on infrastructure stuff, stuff that’s not core business logic. And so if there’s any tool I want to call out, maybe as bad, it would be maybe Kubernetes. Maybe, I don’t know. I feel like AWS is kind of annoying. All these like Terraform, Docker, all that stuff. I just want to do data. Why do I have to write YAML files? I don’t know. YAML, I’m going to call out YAML. I hate YAML.

Tim:
It’s like unnecessary evil at this point.

Juan:
That’s the point, unnecessary evil. I think people hate it, but they love it. It makes… You need something like it, I guess. So here’s the point that you’re making or how I’m interpreting is what sucks is all the pipeline, the infrastructure stuff, what I really want to go focus my energy is on the business logic of actually doing things with data. And so all the infrastructure stuff is what sucks right now?

Erik:
I mean, look, I’m old enough that I’ve been through a lot of… I’ve been doing this for many years and 10 years ago, everyone was doing Hadoop. And Hadoop in a way was quite good, because it was the first tool of its kind that let people operate large data sets. And for that reason, I love Hadoop, but looking back 10 years ago, I also realized now, Hadoop is terrible in so many ways. There’s so many more efficient… There’s so many better ways today to work with data, that’s not Hadoop. And so to me, that’s a question I ask myself a lot, is like, what are the tools today that are like the Hadoops of tomorrow? What are the tools we’re going to look back at in 10 years and say, actually that was really bad, but I guess we didn’t have any options.

Juan:
Okay, so this is interesting. So at that point in time, you would say this was a good tool, but then 10 years later you’re like, well, this is a bad tool?

Tim:
Or maybe an exciting tool versus then becoming a bad tool?

Erik:
Yeah. I mean, I think tools often start out as enablement. The first version of every tool in the tool chain is always like a thing that lets you do a thing for the first time. And that’s exciting. You’re like, wow, finally I can operate in a large data sets. I can train deep neural networks. I can do whatever. And then incrementally, newer generations of that tool end up actually transforming it from this pure enablement to actually making engineers productive. And they usually lower the cost of operating a large data sets of whatever it is by multiple orders of magnitude. And Hadoop query that 10 years ago would take me a whole day to write and run now it’s like a SQL query. I just run it and I get an answer. Right?

Erik:
So that’s not to say Hadoop was bad. I think it was just like a first example of something that enabled me to operate at scale. But over time, progressively new tools came and just lowered the cost of doing that by probably like two orders of magnitude. It’s probably 100 times easier to just write a AQL query today in Snowflake whatever and we get an answer immediately.

Tim:
Right. Hadoop’s an interesting example because when it first came out, everybody was like, ah, MapReduce, and Pig and Sqoop and everybody was getting excited about the initial ecosystem there. But then it very quickly just iterated towards actually I just want to run SQL. Right? I just want to run SQL on Hadoop and there were like five ways to do it. And now there’s really only one. And then AWS hosted EMR and it got better and better. And so Hadoop disappeared. It all went back to SQL sort of. Right? Does that cycle repeat itself for these tools that are exciting, but then fade a little bit? Does it always come back to common things like the warehouse and SQL and stuff like that? Curious about what you’re seeing, where things are trending towards.

Erik:
I don’t know if it’s a cycle, but you’re right that something happened where in 2005 or whatever, all the enterprise companies were using Oracle and stuff like that. And the startups, there were starting to experiment with MapReduce. Maybe it was more like 2008, something like that, 2009. And so it was this weird thing where all the startups were using Hadoop and all the enterprises were using SQL. And now Snowflake was like, we’re going to build a better Oracle sort of but for data warehouse for old app type stuff. And so they built this tool and then I think it’s like one of the rare examples where enterprise adoption was actually earlier than startup adoption. Because startups adopted this bad thing in retrospect and then Snowflake did a really good job.

Erik:
And eventually, they went over the startups, too. I think it’s actually a rare example. I don’t know. Maybe we’ll bounce back again. We’ve seen there’s a lot of history specifically with SQL. There was three or four times in the history of SQL where people thought it was a really good idea to put business logic in SQL. A long time ago, people are experimenting with store procedures and triggers and all this kind of stuff. And then we realize it’s a bad idea to do them so we started taking out the business logic and putting it in the app instead.

Erik:
And the database was more like this crud, you just read and like… So it’s interesting on the backend side that certainly happened. I don’t know if that’s going to happen on the warehouse side. I don’t know. People are very productive in SQL, but I think there’s an argument that when I look at them, yeah, a lot of people… There’s a lot of business logic in SQL, is that the right place to put it? I don’t know. I’m not convinced, but I don’t know, people like it. So I think that’s [crosstalk 00:10:46].

Juan:
Let’s dive into this, because the business logic is something that gets moved so many places, right? It’s in triggers and views and stuff. And then we put into the application logic, but hey, I mean, putting all this logic in the application, not just the silos of data, but you have silos of application. People don’t know what this stuff means at all. So now we’re starting to go see things like dbt and putting a lot of the logic, the modeling in these transforms inside of dbt, which it’s just SQL outside of the warehouse, but eventually gets connected somewhere in the warehouse. So where do you see then, the business logic being implemented? In YAML file?

Erik:
No, that would be a disaster. I think that’s the worst.

Juan:
That was a joke [crosstalk 00:11:31] you can’t see my face but-

Erik:
No, yeah. I do. I think, I don’t know. I mean, in a way I’m the biggest fan in the world of SQL. I wrote this long ramp blog posts, a couple years ago, that’s about how much I hate random DSLs and query languages and how I just want my SQL back. And then a couple years later, now there’s SQL everywhere and I’m like, oh, let’s take it a little bit slower, what’s going on? So I don’t know. Maybe it’s a good thing. Maybe it’s a bad thing. I think having a lot of SQL is certainly better than having a lot of bad code languages for sure.

Tim:
Well, we definitely see things get hyped, right? And so maybe now the train left Hadoop station, it went to TensorFlow station for a little while, and now it’s all on the SQL kind of train. Right? And I’ve seen a lot of tools that are just like dashboards. All you have to do is write SQL and obviously dbt is now something that’s really, really popular. It’s taking the world by storm right now. Even within this realm of SQL oriented tools, are you seeing certain things that are more popular, are less popular, are more useful, less useful? What’s your thought process there?

Erik:
I don’t know enough about that. I mean, I know, Materialize, I think is building something incredibly cool. I think, it’s kind of a risky thing. Materialize for the people that don’t know, basically the whole idea is you can build incrementally materialized view. So the idea is in SQL, you can define a view and that gets incrementally refreshed with new data, which I think is pretty cool. You can do a lot of real time data transformations using that. I think it’s still like TBD, that’s a lot of business logic again that we’re putting in SQL. Is that a good idea? I don’t know. I think. I don’t know. But Materialize is certainly like a tool… And full disclosure, I’m a nominal investor in that company and I know the CEO, but other than that, I think there’s a lot of exciting stuff.

Erik:
In general, I am overall pro SQL. I think it’s still to be determined if… What I think is important to remember too, is also that I think the demand for data is just enormous. And so, there’s been this interesting shift in the last five years where basically making data available on a SQL level has been by far the easiest way to make it broadly accessible to people. I’m also quite bullish on code and I think in a way, the fact that we’ve seen the pendulum swung so far towards SQL to me means that’s a reflection of how large and how desperate people are to work with data.

Erik:
They’re like, I just want to get data, do stuff on it and it’s hard in code, so I’m going to do it in SQL. So I’m also very bullish on code, like Python, stuff like operating on data, too. I think to me the success with SQL, it is maybe less of a reflection of SQL in itself. And just more of a reflection of people really want data and SQL just ended up being for now, like the fastest way to get people access to that data.

Juan:
So what’s interesting is that we’ve been talking for the last 10 minutes about SQL. I think this is some evidence in a way that SQL has transcended a test of time. Right? You’ve seen cycles of Hadoop was cool at that moment, then it goes down. Right? Whatever’s cool today probably is going to go down tomorrow. But SQL has been there forever and one could argue then that SQL is a good, cool tool, even though it’s old, it’s been around for so long. It’s just so valuable that so many people know it. What else is out there like SQL, that you’ve seen… I mean, I guess like programming languages, like Python and stuff, but what else will you include in the test of time that they’ve been around for so long and they will continue to be here?

Erik:
C. I think SQL is like 67 or something ridiculous, like C is from like 72 or something. They’re both 50 years old. That’s insane. And so, I don’t know if you know or you’re familiar with the Lindy effect. Basically the idea is the longer something’s been around, the longer it’s probably going to stick around. So when I’m looking at programming languages and whatever we going to have in 50 years, I think we’re far more likely to have C and SQL in 2061 than we are to have… Not to throw them under the bus, but just saying R, Python or Julia or whatever. And maybe it’s a little bit unfair to bucket the same thing, c is obviously a very different thing to systems language, but it seems weird. But when you think about it, I think we’re going to have a lot of people writing C in 50 years, which is crazy when you think about it. I could be wrong. Who knows?

Juan:
No, I think I’m with you on this. I mean, we still have mainframes and still people need to go learn COBOL and stuff like that for stuff that exist for few years ago. But-

Erik:
Yeah, for sure.

Tim:
But it goes back to the comment you made earlier about adoption, right? It seems like the tools and technologies that get adopted are likely to be the ones that continue to see more adoption. Right? And they continue to see the test of time. And it’s like a momentum thing.

Erik:
I think that’s right. I think there’s actually another effect too, which is I think… I was tweeting something about that this morning, actually. I think I’ve also increasingly been convinced that there’s a certain amount of conservatism that’s good for programming languages almost. I think some languages are so eager to solve every new problem in some new way. And then they just get very bloated. I look at C++ and it’s like, I can’t use this is, it’s insane. I’m a little nervous about Python. The surface area of the language is enormous these days. I’ve done a lot of Async IO in the last six months. And it’s a mess. It’s so complicated.

Erik:
I love Python. It’s by far the language that I’m the most productive in, but there’s just so much surface area today. And so I think that’s another interesting point is that SQL in a way, part of why it’s successful maybe it’s also because it hasn’t really evolved much. There’s little like JSON extensions and like stuff like that. But overall, maybe part of its success is it found this very conceptual integrity… There’s nice quo in the ’60s and ’70s that solve the problem beautifully. And then just stuck to that. It just never tried to do all this other stuff. I don’t know.

Juan:
I mean, it’s a stable language and it’s very simple. If you think about SQL, right? It’s grounded in this relational algebra, which is really very simple. You got tables that you apply some operations and that generates a new table. And then you keep doing these operations over and over again. That’s pretty much it. And I think I hear you, SQL is just as beauty of a language because one it’s the clarity. And I think that’s something which is super important right now. That it transcends the test of time because it’s something declarative because it’s something that it’s easier to maintain. It’s easier to go infer about. And I think when we go back to the issue of the business logic, you want business logic to be associated to a declarative language. That’s my point of view.

Juan:
I think that’s what we want to be able to… Because that way we can very quickly trace things. You can have the lineage, you know where things come from, it’s easier to infer about it, to reason about it. So that’s why I think it’s super important to figure out how to go manage business logic in a declarative language. And yeah, that is going to be something like SQL, maybe like a DSL, but I think SQL is always going to be something we’re going to be using. And I think we should be pushing more of our business logic into SQL. That’s why I’m a big fan of dbt. And I think this is something that will be around.

Erik:
Yeah. I agree with 80% of what you’re saying. I think making sure things are declarative, I think are quite important. I think SQL despite its strengths has many flaws. It’s not modular, it’s hard to reuse things. So there’s many issues with SQL too where I think it could be much better. But you’re right in the sense that I think this declarative nature is very, very useful. And my feeling is also languages like Python can be quite declarative if you just design it. If you build frameworks and libraries in such a way. I’ve been playing around with Pulumi, which is a very different application. It’s about defining infrastructure. Pulumi is like… You can actually define declaratively infrastructure in Python and in the end, it does a state-diff between what’s there in the cloud and what’s there locally.

Erik:
And then applies that diff. I guess React is a similar way. JavaScript is non-declarative language there. I think it’s an incredible language. But React has this beautiful, I think relational nice thing, where you declare what you want. And then the framework takes care of making that happen inside the browser. So I don’t know. I think there’s a lot of interesting directions to explore.

Juan:
So is the future then about finding higher levels of abstraction? Because if you think about being a declarative of language in SQL, it’s just a higher level of abstraction. What you’re just describing is more, I want to go define the what and not the how, is that where we’re seeing a trend, is that where the next cool good tools are going to be that will probably transcend the test of time? I don’t know.

Erik:
I think so. I mean, I think that’s always beneficial. You want to express the what, not the how, right? In SQL, the query engine takes care of that. And so I think that’s true.

Juan:
So what do you think about all this low-code and no-code?

Erik:
I think it’s fine. I mean, I think, I have a similar view on low-code and no-code that I have expressed earlier, which is that to a large extent, I think it reflects the extreme demand for software that exists in the world today. And that people are so eager to build software that they want whatever tool needed. And one very crude, reductive way to think about, what is a software engineer? Software engineer, I think is actually just… Their job is to take business goals and express them as business logic. And programming languages are just like one concise way to express business logic. So I think you’re always going to need people who are trained at taking a fuzzy objective and then think through all the edge cases and how do you make it into logic and that’s just unavoidable as a problem.

Erik:
I think you’re always going to need people to think through all the edge cases and all the things. To me that’s what a software engineer does. Right? And so, that’s why with no-code or low-code, you don’t really escape those problems, you just push them somewhere else. And so what I think we’re going to see is a lot of companies adopting those tools, that’s great to get started, but in the long run, they’re just going to reinvent software engineering. And then they’re going to realize, actually we should just hire software engineers to take care of this. Because now we have, a billion, trillion edge cases in this… Now we have to maintain this YAML file and it’s really hard. And I wish there was a way to debug it and I wish there was a way to test it.

Erik:
And software engineer comes in and they’re like, yo, actually we figured it out, it’s called unit testing. And by the way, we have this cool thing called Git and we have this cool thing where we do continuous integration and pull requests and everyone’s going to be like, wow, is that… And hopefully by that time, we’ll have even better programming languages for expressing logic. And so I don’t know. I think to me, the success of low-code and no-code in a way is a good sign if you are a software engineer because it means the demand for your services is going to be a bit higher in the future.

Juan:
So you mean that if no-code, low-code is successful, they were going to start building big or creating new holes that need to be filled in. So they’re going to go just hire more software engineers. So software engineers will always… I mean, I think they’re always going to be in demand, but there’s going to be higher in demand.

Erik:
Yeah. I think that’s right. And it’s not you can have them fix all the problems caused by low-code and no-code. But I’m just saying, I think that success of low-code and no-code is to me a sign that the demand for software is very high. And I think to some extent also, when I think of no-code and low-code, I think the ones that are going to be successful are also the ones where engineers want it. There’s many tools that are built in order to go around software engineers. You see this very commonly in marketing teams. Marketing teams are like, oh my God, it’s so hard to launch a campaign. I just want a tool to do it myself, whatever, do an A/B test or something.

Erik:
But I think to some extent, that speaks to an organizational dysfunction, where in reality, the best ways to solve it is to embed software engineers with the marketing people, call it a growth team. And then have tools that both the marketing people and the engineers can work with at the same time, whether it’s Adobe or Figma that designs that, or whatever it is, tools that everyone likes and that engineers like because they can extend it in code, but also the business people like because they can just see things inside them. And so that’s what I think is the future of all these things.

Tim:
I like that perspective. It gives you an idea of like… Because as a product manager, I get excited sometimes about low-code, no-code tools. For example, about a year ago, what’s that company called? Bubble.com or bubble.io or whatever, they let you basically build your own application, wizzy wig style. And I was like, wow, this is so cool. But it’s like an example of a tool that goes around software engineering. Right? And so the second that you’re like, hey, I really want to build this really big robust application then you’re like, oh I just locked out the engineers when really I want them to be part of it. And I think that’s where things like dbt or things like Dexter or things like these types of tools that are like, oh, let’s bring visual aspects. Maybe we’re even evolving to some no-code aspects in the future. But in general, these things really embrace software and code as a core sort of thing. It’s nice to see tools that can do both these things.

Erik:
Yeah, for sure. And I think to key to success for a lot of these tools is actually bridging the gap between different business functions that previous didn’t work together. And I think by doing that, actually your vendor locking gets even tighter. Because if you have two different functions to both want to keep a tool, that’s exponentially harder to get rid of it in a way. Whereas if it’s something [crosstalk 00:27:24].

Tim:
Right. It’s something that binds us together.

Erik:
Yeah, exactly. And so those tools I think are going to see a lot of success.

Tim:
So, to take us back to tools again, before we go into a slightly different direction. There’s a lot of different parts of the data stack and Juan and I were really curious what is the Erik Bernhardsson modern data stack or preferred data stack or whatever, right? If you had your canvas of data tools, what are you pulling together? What’s your CI tool, what’s your data warehouse? And feel free to give multiple options if you don’t want to pick favorites. But curious as to what your vision of stack is.

Erik:
I think it’s funny because every startup has a blog post today and they’re like, here’s the new modern data stack. And by the way, we’re like this big box in the middle.

Tim:
My company is the big box. Right?

Erik:
And then there’s like, we’re right next to dbt because everyone loves dbt. So they’re like, here’s our startup and then right next to it is like, oh it’s dbt and then there’s like Fivetran and then there’s a bunch of all these ancillary things around it. They’re like, yeah, this airflow of whatever over there. So I don’t know. I think the modern data stack, I’m increasingly skeptical that it exists. I think it’s just like whatever people wanted to be, it’s like Rorschach blot, you just see whatever you want to see in it. So I don’t know.

Tim:
I guess I’m asking you, what do you see when I show you the ink blot?

Erik:
I think I see a lot of boxes and the question to me is, I don’t think anyone is happy with that fragmentation because if you’re a company that works with data, do you really want to bring in 35 different tools and duct tape them together? Especially if you’re an enterprise company, do you want to go through 35 procurement processes? Certainly not. And so, I know I’m dodging to answer to your question, but I think to me, at least we’re probably going to see a lot fewer boxes on that. And I think over time, there’s going to be a lot of consolidation in this space and a lot of tools taking over adjacencies and doing multiple things over time. And so that’s a trend, I think we’re going to see a lot of. [crosstalk 00:29:52]

Juan:
I want to push on this because you are dodging the question, but you just said about consolidation. So what is being consolidated into what types of buckets and that way let’s see if we can figure out if you can actually answer that question.

Erik:
I mean, I think the best example is Databricks, right? Databricks is trying to do… They start out with a query engine, but now they try to do a data warehouse. They try to do data linear like all this other stuff. Right? So that’s a clear example of it, vendor trying to do a lot of stuff. And I think we’re going to see a lot more of that.

Tim:
Do you see that as a good thing?

Erik:
I think so. I mean, good, I think it’s economically the inevitable outcome here, because as a vendor, you want to be able to cross sell and subsidize your CAC or advertise your CAC over multiple products. And as a company, you don’t want to deal with 35 different vendors like I said. So I think over time it is a good thing. I think maybe altruistically on the bad side, is it good that vendors get quasi-monopolies over the data stack? No, that’s probably not that good. Because what if they stop innovating, turn into Oracle and whatever, just charge people humongous amounts of money. I think that would be bad, but I don’t know. I think the startup world is so dynamic today that also there’s so much competition and I think it’s for the better for everyone.

Juan:
Now, today, I’ve had so many different conversations. I met this morning with Sarah Catanzaro from Amplify. And then [crosstalk 00:31:39]. Yeah. So she will be a guest next year. And then I also spoke with Andy Palmer from Tamr, who will be a guest in a couple weeks. And this was the topic of wait, there’s so many different tools right now. I mean, look at this landscape that everybody is going to be like, do I need to go buy 35 different tools? No. And Sarah’s argument’s like the problem actually going to be that I don’t want to go through procurement. I don’t want to go through the sales process of going again through all these tools. I literally do not have the time to go do that. Your engineering team is going to spend all their time going through these sales process, so that can’t happen.

Juan:
There will be consolidation over there and talking to Andy was about, yeah you have this Oracles, these companies that were just this big monolith but they stop innovating. So they bring in the new thing so you go… One side, we’re like, well, you got the Oracles of the world who are going to be the model, so you can do everything. And we’re like, maybe the old school people think this is possible, but that’s not really what’s going to happen. But then we’ve taken it to the other extreme. And it’s like, well, now look at Matt Turck’s data learning thing. I mean, this is ridiculous. So we’ve gone literally to the other extreme, and now it’s just so much stuff. So where is that sweet spot? So what is the consolidation that’s going to happen?

Juan:
And actually talking to Sarah today was, we’re saying, okay, there’s things around, don’t quote me on this but it was around metadata, around data, around workflow transformations, around analytics. And the thing is that when you look at these different buckets, not everybody will agree, well, hey, I want to be consolidated under this bucket, under this one. Right? So this is going to be such an interesting mix of what’s going on in the next couple years. And what worries me is that everybody starts buying all these tools. And then a couple of years, we’re going to have so much integration debt that it’s… I don’t know. I mean, that’s just [crosstalk 00:33:37] I don’t know how we’re preparing for that.

Erik:
Yeah, no, totally. I agree with you. I mean, I think if you look at any new industry, whether it’s financial industry or whether it’s like car manufacturers or oil producers or whatever, you start out with a thousand companies. And then like over the years, there’s fewer and fewer left and hopefully you end up with more than one in the end because that… But usually what you end up is some staple oligopoly, like two, three, four vendors. That’s what you see in the cloud market. I mean, cloud has economies to scale I think more so than maybe data tools. But I mean, I think any industry you look at, airlines, whatever it is, you end up with three to four or five of companies that control 80% of the market in the end.

Juan:
So that’s really what’s going to be, is it? Companies around metadata, around data management, around workflows, and moving the data infrastructure, some stuff around analytics. Those are four. I mean, I don’t know.

Erik:
No, I think they’re going to be more like vertical companies. It’s going to be like cloud vendors. You buy into this cloud vendor then you get the metadata management, you get the workflow engine, you get the run time, you get the query engine, you get the ETL, you get all of that. But you have to pick one. You have to pick one of these verticals.

Juan:
But that’s happening now. I mean, look at, you got AWS and you got Azure, Microsoft and-

Tim:
Erik, are you saying you’re going to be like, oh, we’re on the Snowflake stack or we’re on the Databrick stack. And that all of a sudden means a bunch of stuff?

Erik:
I think so. I think to some extent like cloud vendors, but look at them, I actually feel like there’s some who are exiting the software and like focusing more on the running of data centers and focusing on the lowest layer of it. And then the software layer above, that’s where the battle is right now. And like you said, I think there’s going to be a Snowflake. I think there’s going to be a data break or whatever it’s like, I don’t know. But that’s what I think is going to end up with. There’s going to be a few players above the cloud vendors. And maybe you can use any cloud vendor with any data provider, I don’t know, but I think there’s going to be consolidation above the cloud vendors into different layers. That’s the analytics stack or whatever.

Juan:
Now this is interesting of the… This is the modern data stack per Snowflake of the data breaks. Right? And this can run on different types of engines. So yeah, we’re trying to predict the future here, which we have no idea about. So how about we switch a little bit onto another topic, which is, we mean to ask you is on people, on teams. Okay, so you dodged Erik’s modern data stack ideal preferred data stack. So here’s another question, which is, what does an ideal data team look like?

Erik:
Yeah. I mean, I feel like maybe you’re referring to a blog post I wrote a few months ago, or an episode of tweeting about this. But generally, I think I’m anti-specialization, but to large extent, I think, number one is I am a general believer in pretty wide skill set, which today is very hard because I think the data stack is so complex. So it’s hard to find someone who knows machine learning, and knows business intelligence, and product analytics, and also knows software engineering and infrastructure. But I think that doesn’t mean the point isn’t true, which is the more general your team is, the easier it gets to coordinate things.

Erik:
So I don’t know. That’s more a general point that I try to… When I build data teams and is to hire people who are decent at everything and above all quite commercial. I think there’s just so much over focus on sometimes. But I think of as mathematical rigor. By the way I studied physics, I have a deep math background. But the truth is really what it looks like out there, when you’re like… I’m mostly talking about startups now, when you’re trying to build a data team is like so much of the value in doing data is going to be building internal data sets then finding obvious, stupid things the company’s doing in those data sets. So you’re probably not going to need to train any advanced machine learning. You’re not even going to have to train any machine learning.

Juan:
Can you say that louder?

Erik:
You’re not going to have to do machine learning. I think you’re are going to find some… If your mindset is like you’re a journalist and then you are going to find the scoop, you’re going to find some nasty stuff going on at this company. And by the way, the company culture is ready for that because I think that’s a separate thing, is like sometimes company cultures are not ready to accept that there’s dumb things happening. But if you get to that point, then a lot of stuff is going to be a SQL query and a scatterplot or whatever. And then, you’re going to find something so embarrassing. You’re going to find 50% of people drop out on the third page of your onboarding form, because your email validation logic is broken or something terrible like that. I may be hyperbolic a little bit, but I think to me, that’s always been where I’ve seen the most value of data is finding these painfully obvious things the company’s doing a bad job at.

Tim:
You’re talking about that whole journey of descriptive analytics, to prescriptive, to predictive, you think about like a lot of companies can get a ton of value from the descriptive analytics and aren’t even doing enough of that already.

Erik:
I think so. And I think maybe there’s a little bit of wishful thinking. They’re like, yeah, we’re already doing so much… We’re doing everything so well already when we need to get to the next level it’s like AI, we need to do this deep learning and whatever. That’s how we get to the next level. And then admit the fact that no, I don’t know. Have you used some… Try to sign up for I don’t know, cable TV or whatever, car insurance or whatever. And you’re like, oh my God, this is so confusing and so bad. And I strongly doubt that there’s anyone on their data team that’s looked at the conversion funnel metrics and really instrumented everything and known every point in that journey, where customers are on into friction or whatever. On the other hand, I think those companies have all invested in AI. And I don’t know why because they have stupid things going on in their web experience, fix that first. Like-

Juan:
I truly appreciate this conversation right now because this is the honest no BS stuff that you hear all the time, these companies are investing millions of dollars in the last five, almost 10 years in creating data science teams to go do all this stuff. I’m like, you’ve put in so much money in this. How much have you actually returned off that investment? And by the way, you still have all these little shitty different problems that you have on your website that that’s the true problem, that any normal person could have gone and figured that out without having to go spend millions of dollars and some data science team to go do some fancy AI that is not actually providing value.

Juan:
I mean, the market, the pundits are all talking about AI. I have to have a team of data scientists. I got to do AI stuff and no, you don’t. You got to go look at the basics and this is the stuff that people are not realizing. And this is such a frustration and they have a checklist. I need to ask if we do AI, so do I? I don’t do it, let me go do something and I can see myself busy. I’m totally with you on this. And there’s just so much BS out there.

Erik:
There’s so much stuff in the world where all you need is a SQL query and a Scatterplot. That’s it. [crosstalk 00:42:26]

Tim:
… right. And a curious mind. It seems like there’s a lot of misunderstanding in general around data and the world of data. And I want to think that a lot of people in positions of leadership in different organizations think that maybe their data pipelines are okay and maybe that they do. Oh, yeah, we’ve got Teradata for a long time or whatever. Right? And they’re like, now the next step is AI and machine learning and all those kinds of things. Not realizing that they don’t have the foundation yet.

Erik:
I think that’s right. I mean, AI has become a checkbox that you tell investors in your company on your quarterly report, you’re like, yeah, we’re investing $100,000,000 in AI, so we’re going to be great. But no, because who cares about your chat bot, fix all this stupid bugs and your product instead, maybe you should invest $100,000,000 in that.

Tim:
We need to have an after party event where we just talk about chat bots and complain about that. That would be fun.

Juan:
I mean, yeah, chat bots that’s another thing everybody’s doing though, but no, no, this is the honest, no BS, but okay. So going back to teams, you’re more of a generalist. And something I always ask people is on the balance between centralization and decentralization. So should you have generalists who live on a centralized team in charge or decentralize this? What are your thoughts based on your experience?

Erik:
I’ve found that both extremes are quite bad because… Let’s talk through it for a second. Centralization, and I ran a centralized data team for a while. The problem with centralization is you’re just going to create so many people throughout the organization that are just mad at you because they’re like, I just need help. I’ve asked them like five times, I just need someone who can tell me what’s going on into data and feel like nothing’s get… You’re basically creating this gatekeeper team internally, that’s controlling the backlog and no one’s going to be happy with that. The problem with extreme decentralization on the other hand is like, let’s say, okay, you tell all these business people, just go out and build up your own data teams.

Erik:
They’re not going to have any clue what to look for. So they’re going to end up hiring probably not the best people. And then those people are not going to be… Their managers don’t really know what they’re doing, so they’re not going to get the feedback they need and they’re not going to grow. And then they’re going to leave because they want to go somewhere else and work for someone who actually knows what they’re doing. And so I think decentralization, maybe it’s a little bit better on average actually, but I think still it’s not great. So I think the only model that I’ve come to endorse is really some hybrid model and it pains me to do that because generally I think matrix models are quite bad. Spotify is experimenting with a lot of matrix models.

Erik:
I spent several years there and overall I think they’re quite bad. It should be avoided. But I think in this case, there is enough of an argument for the benefits of doing that. So what that means is like you have a centralized supporting structure, you have a data team with a strong head of data or whatever, strong set of individuals who know how to hire. They have internal tools. They have internal standards, they know what good performance looks like. They know what bad performance looks like.

Erik:
And then you have all these data scientists or data engineers or whatever who actually spend most of their day working with business people or product teams or whatever. And on a day to day basis, their backlog is driven more by what those teams need. And I think that’s really the only… There’s a couple of other things that resemble, design tends to be the same thing. QA tends to be the same thing, is a couple of engineering… I used to be a CTO for many years so I thought a lot about this. But generally, other than that, I wouldn’t recommend hybrid models for many other types of jobs. But I think data is one of the professions where it actually makes sense.

Juan:
Right. This is something we agree with. I think the two extremes is something that definitely, I mean, those won’t work. And you bring up something interesting, which is, if you’re completely decentralized, these folks don’t even know how to go hire people, right? They’re not going to go hire the best people anyways, and they’re going to go leave. So that’s another interesting point right there. Look, time flies, we can keep talking, but we need to start wrapping up and we got some lightning round questions here. I think we’re ready. And the lightning round questions are going to bring up some topics that we wanted to talk about, but we’ll leave them for the next one. I’ll kick it off. All right. First one, do you think this no-code, low-code programming is a legitimate trend? Yes or no?

Erik:
Yeah. I mean, I think so with a lot of best risks, but I think, yes.

Tim:
This also came up a little bit in our chat today, but will companies like Snowflake and Databricks, so this the data stack, data layer, do you think that’ll sustainably stay separate from companies like a AWS, Google, and Microsoft?

Erik:
I think so. Because I think to what I mentioned earlier, I think the cloud vendors, this is maybe a longer thing, I could write a whole blog post about it. But I think about what they’re good at is running data centers and that’s a very different type of business and running software on data centers. And so I think increasingly, we’ll see a layer above with companies like Databricks and Snowflake who are writing the software and AWS specializing more and offering the hardware.

Juan:
Right. So we talked a lot about the business logic and where it should be. We didn’t talk about data modeling and semantic layers and stuff. So question is, is a semantic layer or the data modeling layer, is this still a big missing piece?

Erik:
Maybe. I don’t know. I think so. I don’t know what… If you’re talking about Semantic Web, I don’t think so. Because that was always a weird thing to me, but I certainly think there’s like more tooling, people talk about data mesh. I feel like it’s weird and like, I don’t really get it. But I do think there’s something there around like, okay, you have a lot of data sets. How do you classify them? Who should have access to what? What do they contain? What’s related to what? What’s clean enough that you can use it for certain things? Right. So maybe that’s like catalog in a way that I’m talking about.

Juan:
All right.

Tim:
Interesting. I’m hearing mostly yes there, but that’s interesting. And when you talk about the hybrid model, I hear mesh there, but it’s interesting the way you talk about mesh. I

Erik:
… mesh is also like it could be anything. And so my worry about mesh is all these companies are going to look at data mesh and they’re going to be like, yeah, we have databases all over the company. A lot of different data warehouses, that’s data mesh. Right? And then some consultant is going to charge them, send them $100,000 invoice to say, yeah, that’s really good, that’s a data mesh.

Tim:
Yeah. I think that’s a good thing to be worried about. When a concept becomes more about the concept, as opposed to what it’s actually trying to explain and represent. Right? Our last question for lightning around for you here, so this analytics engineer concept obviously is becoming more of a thing. Do you see analytics engineers versus data engineers or analysts as a legitimate role? Something that’s going to last?

Erik:
I don’t know. Generally, I think we should have fewer roles. I think data engineers should go away. Because if you think about it, data engineers are not doing any business logic. And so, I used to be a data engineer for years. I think in many companies, you should of course hire data engineers. But I think in the long run, it’s almost like the goal of a data engineer should always almost be to make yourself useless. And I think through better tools and better infrastructure, we’re probably going to see less data engineers. Analytics engineers, I don’t know. To me an analytics engineer is kind of the same thing as data science. I don’t know, maybe it’s the same thing. Maybe it’s not. I think we’re pushed into specialization that maybe we shouldn’t pursue. But I think if it’s data scientists that should go or analytics engineer that should go, or maybe analytics scientists, I don’t know. Whatever we end up calling them, I almost don’t care.

Tim:
Yeah. That’s a very honest statement by the way about… We love specializations I think, as an entire industry and we even have a couple of specializations that we like, like data product managers, for example, knowledge scientists is something that we talk about. There was actually an interesting thread on Twitter. And Erik, I don’t know if you were a part of this, you may have jumped in on it about this increasing role of SQL in the stack, and how somebody on Twitter was talking about how they’re seeing more and more data engineers actually becoming… Seeing that their role is getting shifted a little bit a away from the infrastructure and the code, non SQL code and into more of the SQL stack and things like that. Right? Things like dbt and that sort of thing. And that a lot of data engineers are actually looking now to move back into backend software engineering and things like that. And now the analytics engineers are the new data engineers. I don’t know if that has any legs or not, but it’s a very interesting thought.

Erik:
I think that’s true. And I think maybe that’s a sign that data engineers are done. Maybe mission accomplished, we built the tools we needed now let’s move on. Hopefully, analytics engineers in the long run, maybe there’s something similar. I don’t know. Maybe they mutate into data scientists. I don’t know. But I think you’re saying in a way same thing as I just said. I think, I don’t know.

Tim:
Yeah. I think so. It’s moving up stack, right?

Juan:
You both are in agreement here. So hey, it’s our takeaway time TTT, Tim, take us away with some takeaways.

Tim:
Yeah, sure. Two key concepts Erik, that I really appreciated and took away from this conversation are when you said at the beginning, let people use whatever tools they want, and in the end, adoption is what is good. The tools that get adopted are the ones that people are using. And those are the ones that’ll stick around. I thought that was a good way of thinking about it. And when we started talking about the bad stuff you had started talking about infrastructure and it actually comes full circle with the comments we just had about data engineering, which is that it seems like, a lot of data work traditionally and still is this sort of the pushing and shoving and shoveling of data. Right?

Tim:
And the more we can start to get out of the shoveling of data and the shoveling of cloud machines, and the more we can focus on business logic and business questions, it seems like that’s good. That’s good momentum. That’s good movement. And there’s too many boxes. There will probably be fewer boxes in the future. There will be consolidation. There needs to be consolidation, but there’s a couple of things to watch out for in terms of openness and in terms of price gouging and things like that. Right?

Erik:
I think that’s right.

Tim:
What about you? Oh, yeah, go ahead.

Juan:
There’s this big theme around connecting things to the business. I think this is what you just said right now that for the data engineers, they should not be involved. They should get out of a job and they should be providing value. And that’s the stuff more the business where the business logic is. I think we talked about these declarative language, a SQL is that important for the business logic. And I think we agree that higher level abstractions are better. This test of time of what’s actually going to be good or things that have been around for 50 years, it’ll continue to be around. Right? Like you said, SQL and C was the one, right? And one very important thing you said, what you need is SQL queries, Scatterplots, and a curious mind, love that. Hey, Erik, back to you. Thank you so much. What’s your advice and who should we invite next?

Erik:
Wow. What’s my advice? Is keep a curious mind, write SQL queries and Scatterplot. I don’t know. I think my biggest advice is, think about what the business needs. Because I think that so many people get excited about tools and they get excited about specialization and I think people sometimes forget about, okay, I’m actually here to build a successful business. And I think that’s always been my best career advice is every morning, come in with the perspective that what is the biggest value that I can add to the business today, and then do that.

Juan:
That’s great advice.

Erik:
Who to invite next. So I was going to say Sarah Catanzaro but sounds you’re already talking to her, so now I’m drawing a blank.

Juan:
No. Okay, great. I mean that’s fine. I finally met her in person today because she was here in Austin. So I’ll make sure that she listens to this and have her know that you called her out. So. Awesome.

Erik:
Cool. [crosstalk 00:56:23].

Juan:
Erik, thank you so much. This was an awesome conversation. So many good… I mean, very wise things have been said here. A lot of things that go back, we appreciate it. Cheers.

Tim:
Cheers.

Erik:
Cheers. It was fun. [crosstalk 00:56:37]

Juan:
And next week we have Cindi Howson, who’s a chief data strategy officer at ThoughtSpot, and the host of the Data Chief podcast, one of another good podcast I like to listen to. So stick around, have a great Wednesday. Thanks again.

Enter Content Here.