What holds the modern data stack together and makes it the architecture of choice for so many data-driven enterprises? Tim and Juan were joined on the Catalog & Cocktails podcast by special guest, Nick Schrock, founder of Elementl and creator of Dagster and GraphQL, to chat about all things MDS. Below are a few questions excerpted and lightly edited from the show.
Honest, no BS, is the modern data set just a bunch of technologies?
Well, I would describe it as an emotional state. I actually think it started out as a stack of technologies, but it is a methodology and a mindset. And the way I frame it in my head is that we’re effectively rebuilding data infrastructure from the ground up in the cloud era and also what I’ll call the modern era, modern being defined as every enterprise in the universe has complex data needs. They’re ingesting from all sorts of SaaS services, and being able to effectively use data is a base level capability and expectation. What does that mean? One, is that the cloud data warehouse… Or maybe a lake house. But some sort of centralized store like that is the center of a company’s data universe, they use and bias towards using managed services, and that there’s this software engineering mindset when it comes to data. I think that’s a really interesting thing to dig into because I think there’s a lot of misconceptions around that.
I think the short answer, I think the modern data stack started out as a fairly narrow definition where it’s like, “Okay, you choose a cloud data warehouse, you have DBT, an ingest tool, a BI tool,” that’s a modern data stack. But when people encounter the reality of the world, their needs expand, and they grab for more tools. Like recently, the modern data stack has expanded to include reverse ETL, for example, and I think that expansion is going to occur, which means it’s not a static set of technologies. It’s a mindset and a methodology about how to build data infrastructure and data platforms.
So, we start with these four: cloud, data warehouse, DBT, and just analytics. You say it’s not static, so it’s dynamic. How is this extending?
The way I see it is that companies are building up their data infrastructure from scratch, they’re cloud first, and they’re answering the questions that you answer in order. Meaning that the first thing you do is that you count things, like understanding very basic metrics about your company or your enterprise. How many users do we have? What is our revenue? In order to do the basic counting, it’s like, “Oh, interesting, we’ve ingested our data from all our different sources. We want to re-inject that into those SaaS products so that you can surface the right information to stakeholders in their native tool. Then you have reverse ETL.
But then the people who build these data platforms naturally want to expand things, so maybe they want to build ML and experimentation platforms, it’s very naturally adjacent. Most of the work is in the data processing, so there’s natural bleeding between those two use cases. And then, things just expand beyond the scope of only SQL computation in general. People need to write custom code to do lots of things, etc.
So, to me of modern data stack is simply following the natural evolution of what happens within a company when you’re starting to expand. And then it’s like, “Oh, we have so much data that we need to catalog it.” Right? And then you start looking to cataloging tools. “Oh, there’s enough stakeholders here and there’s enough teams that we need to start doing data lineage.” There’s a natural expansion as you invest more and have more capabilities in your data platform. And I think, effectively, what’s happening in this modern data stack landscape is that people expand their gravity for more tools to solve those problems that they absolutely need to solve.
Where does orchestration come into this?
That’s a great question. So, I was actually on a different podcast. Apologies, I cheated on you. I was on a different podcast a couple weeks ago, alongside Scott from Brooklyn Data, and he was describing a data platform without orchestration. It’s like a bunch of kids in a sandbox and they’re not even talking to each other, they’re just doing their own thing, but what you really needed to do is coordinate and work together. And that’s really where orchestration comes in.
So, in my mind, there’s a couple things you need to do. One is adding operational robustness to your existing tools. So, without an orchestrator, what’s interesting is that we’ve regressed in the modern data… We’re talking about the modern data stack mostly. If someone’s using Fivetran, DBT Cloud, and a reverse ETL tool, they’re now stuck in a world where they have overlapping jobs, where you just have to hope and pray that one works after the other. If something goes wrong, you have no tool where you can debug things across those tools. You have three sandbox operational tools, and you’re scratching through logs and each of them and figuring out, “Wait, was the error in the previous tool?” You have no single pane of operational glass. That’s a problem. The data isn’t as up to date as you want.
And then, God forbid, you want to do a computation which cannot be expressed in Fivetran or a reverse ETL tool or SQL, what do you do? Right?
You have to write some code to do any number of things. So, this is really where orchestration comes in .I’d like to say that the orchestration really comes in when you need to start assembling your modern data stack into a platform where there’s a single operational plane of glass, you want things to be more robust, and you need to use a heterogeneous tool set.
Is modern data stack just really for the smaller companies or the younger companies that are earlier on their journey?
I don’t think so. It kind of goes back to the original premise of the discussion which is like is it a methodology or a very narrowly prescribed set of technologies? So, that’s what I think is interesting. Should companies, to use the framing of like move to the cloud, move to managed services, and apply software engineering mindset to their data processes? I would say, yes, you can call it the modern data stack or not, but that’s like an undeniable win. And then you also see companies incrementally adopting technologies in the modern data stack within their organizations, as well.
And then the other thing that… for example, one of the reasons why Snowflake’s doing so well is that they’re doing such a great job of lifting and shifting workloads from on-premises data warehouses to them. They were kind of playing a different game the whole time where I think a lot of the engineers in the valley would be like, “Oh, you’re going to get people to migrate from Hadoop to snowflake?” It’s like, “Guys, 99% of the world is still on their on-premises data warehouse and have no ability to use Hadoop at all.” They’re jumping straight ahead. I think that there’s going to be a similar thing because as people adopt the cloud data warehouses and they have the same problems, everyone else, they’re going to be grabbing for the next tool. So, I think that these technologies, the shift towards this style of data infrastructure is inexorable, both for Greenfield and existing companies.
What are the do’s and the don’ts for more mature companies to start in that modernization process?
What I have learned through my years is that having an incremental process in place so that every stage of a migration and moving from one technology to another feels like you’re just hiking up a hill rather than jumping over a canyon. So, always construct these migration processes such that there’s always intermediate checkpoints, you can stop, assess, understand what’s going on, that way the people who are participating in the migration get value as early as possible so that they can see the promised land, as opposed to being promised that in two years life is going to be better. I like to call this evolution means for revolutionary ends. Have a strategy, have a high level vision, but have an incremental process that you can stop and check and make sure that things are on track, and then deliver value to your stakeholders as early as possible in the process.
- Assembling MDS tech into a platform.
- All you need is Simple MDS: cloud data warehouse, dbt, ingest, analytics.
- Work together for operational robustness
Visit Catalog & Cocktails to listen to the full episode with Nick. And check out other episodes you might have missed.