Zhamak Dehghani, Director of Emerging Technologies at Thoughtworks joined the Catalog & Cocktails podcast to chat about the emergence of the data mesh as a concept, why the approach works for eliminating architectural silos, and how it’s producing more data-driven cultures. Below are a few questions excerpted and lightly edited from the show.
Juan Sequeda: Honest, no BS, what is a data mesh?
Data mesh is an approach. It's not a thing. It's an approach for designing and architecting your big analytical data management, based on a decentralized architecture. Governing that architecture based on a federated and computational governance. But it has to also address the concern of, how do you do that efficiently and effectively? So it also talks about the foundational infrastructure that you have to put in place.
That's why it confuses people. Because it started as an architectural paradigm, in managing big data and analytics data architecture. But it had to go further, and become more, to not create a mess. It also addresses how to think about the architecture of infrastructure in that space, and how to think about your organizational infrastructure in that space, and how to think about governance of that. That's the paradigm.
You can apply it using different technologies to your organization. It doesn't try to be prescriptive about what technology to use, even though I am very opinionated about that.
Tim Gasper: You don't buy a data mesh, right?
I agree. If you think about this as an ecosystem, you need to have first a set of standards and a set of conventions that we agree upon, as an interface between components and agents within that ecosystem. That doesn't exist for a large part of it. Then you have to figure out, what are the technologies that then plug in and provide different capabilities? I know I'm talking abstract, so let's put it into concrete examples.
Right now, when we defined data management and started building it, there is no prior art. There is no language. There is no concept that I can describe. The smallest unit of this architecture, and I can put a boundary around it. We call this thing a data product, which is the smallest unit of architecture around which you can form teams. Like microservices, let's say. Operational work.
But that thing actually doesn't exist historically. Because that thing it needs to contain, for it to be truly a distributed architecture inside analytical use cases, needs to have access to storage of data in a way that scales. It needs to have the computation, and an engine, that you can inject computation into it. Because a lot of valuable use cases, you actually want to run your computation on where the data is. It needs to have the APIs and interfaces to serve that polyglot data, and still have a way of injecting your policies around it.
"I want to access the data, but I don't have the access, so give me the differential privacy mode of access. So I can just do analytics without really seeing the forest, without seeing the trees." There's just so much new to it, encapsulated in something that can be a meaningful unit of your architecture.
Then how do we even talk about the technology, when we don't have a language to describe the pieces of the architecture that we need to build? We've got to build a language first. We've got to build a system of defining this world. We've tried to create that language to some degree. Then we can think about, "Okay, how do I plug in the technology that exists today, underneath and above?"
Juan: Is everything decentralized? Or is some part centralized? What's the true balance here?
I always try to be pragmatic and see this as an equilibrium that we constantly have to manage and sustain. I sometimes feel that centralization and decentralization are in fact two sides of the same coin. The way I think about it is that, the moment we decentralize in terms of the data ownership around domains, and sharing through your APIs and domains, and all of those things, in that moment you realize, "Oh, now if I go and decentralize all the way down to the bottom of this stack that supports this model, to the bare metal, does it mean that now every one of my teams, and every domain builds its own decentralized stack, and hope that they would also talk to each other?"
But from the cost perspective, and just pragmatic reasons, is that possible? Probably not. So then what you end up doing is saying, "Okay, I'll keep a layer of utilities to these domains. The tech stack that they need to build these data products." Likely from their perspective, they're seeing this as a centralized layer of APIs. A centralized platform. Within that, you can still have decentralization. You can have different teams looking at different aspects of it.
But to have that ease of use of that technology, it's probably a centralized layer, from the perception of the user perhaps. A centralized layer of utilities that they can use. Then within that you can, again, have decentralization. "Okay, I do access management. You do encryption. I do storage. You do pipelines." Whatever it is, that fits in there.
Tim: How do you think about getting started with this kind of approach?
I would think about it very pragmatically. Why did we want to decentralize in the first place? Because we wanted to mirror how we are decentralizing our business, and other applications. If you haven't, then don't bother with data mesh perhaps. But if you have, and if you have different teams already responsible for different functions within your business, or capabilities within your business, then just use that as a starting point.
If you don't have that platform capability yet, to allow to have these autonomous teams, well maybe there is a point in time that you go from a centralized model, to then a decentralized model. Because having the economy of scale, that every team runs around and does its own thing, and have its own data, and yet these things are connected, and yet these things are monitored and understood at a global level, requires a level of maturity of the platform that enables that.
Then there is the access of evolution, as where you start within the adoption curve of data mesh within your organization, or the curve of transformation. Where you start looks very different from where you end. Then you have to be pragmatic that, "Where I am today, does it make sense to have 50 of these teams running around?” Probably not. But mirror your business. Mirror how your world has been distributed.
Juan: Is a data fabric a data mesh?
No, but they're complimentary, if you think about data fabric, when it was created by app folks, and what problem they tried to solve. They tried to solve access to data wherever it is, and be able to integrate it. That was a point that people were going to the cloud, so they had to solve the problem of hybrid. I've seen data fabric implementations that, at the end of the line, they get this data extracted from all sorts of databases placed everywhere. But at the end of the line, they dump it into a lake, or a warehouse, to actually run analytics on it.
I think fabric can be the bottom layer of the stack. Your bare metal layer of the stack. Then look at it logically, with a new set of technologies as a mesh overlaid on that. I think there is synergy, and they complement, but they're not the same thing.
- Data mesh is not a “thing.” It’s an approach based on decentralizing
- Don’t over think it. Do what makes sense for your organization
- Compute + Policy + Data = one autonomous unit
Visit Catalog & Cocktails to listen to the full episode with Zhamak. And check out other episodes you might have missed.