Data is from Mars. Data Science is from Venus.

by | Mar 23, 2022 | 2022, data architecture, data mesh, Data-driven cultures

  • Today’s explosion of data is unstructured, feral, even wild. It is heterogeneous, or simply diverse.
  • Which contrasts with the way our technologies have long sought to organize data into usable form, by making it homogeneous – or uniform.
  • Enter ‘data mesh’ a radical new concept reordering decades of data governance best practice.
  • These new ways to both think about data and manage it will at last align the planets of the emerging, data-driven economy.

 

Data is from Mars. Data science is from Venus. Which is a problem we need to fix for those on Earth.

So first my apologies to author John Gray, whose book and eponymous metaphor famously argued that men and women are each acclimated to their own planet’s customs and ways – but not to those of the other. I borrow this from Gray because a similar metaphor well describes the interstellar conundrum that bedevils our elusive command of data. 

To frame this a bit more scientifically, data is heterogeneous, and it grows more so by the day. Data management, by contrast, craves homogeneity, and we struggle endlessly to shape data into homogeneous forms. Therein lies the epic dilemma of our age.

No wonder that confidence in data as a resource is falling, that our businesses and institutions are failing to integrate it, and that the silos where we store these assets are growing evermore airtight.

The situation is far from hopeless. And I’ll get to the help that’s on its way, the fast-emerging new paradigm, prosaically dubbed “data mesh.” My colleagues at data.world have been writing about this new form of data governance, and it will be center stage among the knowledge superheroes keynoting our fourth semi-annual data.world summit on April 7 at 11am CT. But without stealing their more frontline thunder, I want to walk quickly through the history that got us to this troubling juncture and where I see the offramp.

 

A world awash with data

Now, I have written extensively on the evolution of data and I won’t belabor that except to say again that we are in the earliest days of a transformation that will remake civilization atop a new economy driven by data. We all understand by now that the classics of land, labor, and capital are fast giving way to this new if amorphous class of asset. It’s easy to embrace the cliche that data is the “new oil” except for the fact that the analogy obscures much more than it illuminates: Oil’s value lies in its scarcity; data’s value lies in its ubiquity.

For it’s almost impossible to exaggerate just the size and scope of this would-be asset that surrounds us. Just today, as you read this, we will create an estimated 2.5 quintillion bytes of data. Just what is 2.5 quintillion? Well, multiply the cost of the historical infrastructure bill President Joe Biden signed into law last November by 2.5 million. Or if you took 2.5 quintillion pennies and spread them over the surface of the Earth, this new skin of our planet would be five pennies deep.

It’s not just the daunting volume, but the fact that most of that data is “unstructured,” as we in data science put it. It is wild, or often more specifically feral. It comes in the form of health care data from the 98 million X-rays that will be performed today worldwide. It will be in the form of the six billion text messages Americans will send today. It’s to be found in the 720,000 hours of video that will be uploaded to YouTube today –  about five petabytes of data when streamed, equal to 2.5 trillion pages of printed text.

So call this single day’s work unstructured, or feral, or wild. But I find it more useful to think about it as heterogeneous, or simply diverse. Which contrasts with the way our technologies have long sought to organize data into usable form, as Excel spreadsheets, as JPEGs, as XML, or as a metadata schema. These make data useful by making it homogeneous – or uniform.

And it’s not easy. Organizing data in homogeneous fashion is complicated. Just as with so many achievements in technology. Building the first ENIAC, for “Electronic Numerical Integrator and Calculator” computer in 1944, with its 18,000 vacuum tubes, was complicated. Inventing the first data language COBOL, for “Common Business-Oriented Language,” in 1959 was complicated. I’m dating myself but I wrote my final MIS 333K project in 1994 in COBOL at the University of Texas at Austin, but I digress. Revolutionizing data management with SQL, for “Structure Query Language,” two decades later was complicated as well. So was the founding of Amazon Web Services in 2006, one of the primary innovations that gave us the “cloud.”

In short, all the means to manage data until now have been extraordinarily complicated. What’s been lacking, however, is an approach that is complex. Often treated as synonyms, these two terms are in fact antonyms. Said differently, the opposite of complicated is not simple. Simplicity and complicatedness are just gradients on the same scale. The opposite of complicated is complex. 


Complex solutions for complicated data problems

Which is why we need, in the phrase of data mesh inventor Zhamak Dehghani, to approach data that respects the reality that it resides in a complex and “broad ecosystem.” It’s understandable that we would try to convert a complex ecosystem to its merely complicated components. This is the history of human endeavor. On our journey to modernity we have converted forests to orchards, wild jungles to monoculture farms, fisheries to aquaculture, and rivers conveying commerce into dams transforming kinetic energy into electricity. 

Marvels to be sure, these are solutions to complicated problems, made possible by breaking down heterogeneity into manageable homogeneity, into addressable, hierarchical rules and recipes, systems, processes, and algorithms. Complex problems, of unknowns, of infinitely interrelated factors, don’t yield to our tool kit of complicated solutions.

As with all paradigm shifts, the language to fully articulate this transformation is unsettled. Often data mesh is described as a means to decentralize or “federate” data to be managed and used as a “product” by the teams who can apply it directly to their needs. The raw sum of a COVID-19 research database might be drawn upon differently and fashioned into one unique dataset by a team of epidemiologists trying to anticipate the next outbreak, and into another dataset by scientists making real-time decisions on vaccine development. That this departs from the old models of centralized data “lakes” or “warehouses” has been compared to the 1980s revolution whereby PCs disrupted the old model of computer mainframes. 

A more recent example for analogy is the accelerating shift from monolith software architecture to the modular units of code in a microservice architecture that can be deployed independently. Think of the migration by Amazon and Netflix from the lumbering systems where everything from transactions to inventories to customer profiles were stored in a single place, to the nimble models of cloud-based applications based on the humorously named “two-pizza” model – meaning no dataset requiring a team that cannot be fed with two pizzas.

Yet another approach is to leverage knowledge graphs to create a connected, queryable web of your data and business context that mirrors how the human brain conceives of associated thoughts and ideas. These approaches are often combined, providing easy insight into knowledge the data unveils.

 

Moving beyond the monolith

The point being that we must move beyond monolith-based solutions – data lakes or data warehouses – that slow innovation at the precise moment we need to accelerate it. To borrow a phrase from another context, scientist Gary Marcus described related challenges in artificial intelligence in an instructive way: “Imagine a world in which iron makers shouted ‘iron,’ and carbon lovers shouted ‘carbon,’ and nobody ever thought to combine the two.”

To carry forth Marcus’ analogy, data mesh seeks to combine the brilliance of the best data science with the luminance of domain expertise, the result not being steel but what we at data.world call “knowledge superheroes” who are managing and using their own data – with help from the data engineering and science teams to be sure – in precise, mission-specific ways. Data-driven enterprises can now execute at speeds 1,000 times faster, and do so without destroying the diversity and heterogeneity that creates the value of massive data resources in the first place. 

This is our tool set as we now move into the age of complex problems. Climate change is complex. Zoonotic disease is complex. Satoshi Nakamoto’s “Genesis Block” may prove complex. And most important of all, the emerging “data ecosystem” fast-enveloping us in endless ways is of infinite complexity. Heterogeneity is the future of data science because it reflects data’s natural order of volume, variety, and velocity. Homogeneity is the past, as well as tops-down, command-and-control models.

The data planets of Mars and Venus are aligning. We are beginning to harmonize the skills of data scientists with the expertise of subject matter experts. This approach aligns with the natural behavior of successful humans: to be collaborative, and to leverage each other’s strengths. It’s becoming increasingly known as data mesh, about which we’ll talk much more at our upcoming summit and on this blog. 

If you’d like to learn how our team at data.world thinks about data mesh, please take a look at some of our recent work:

We’ll also be exploring data mesh at the data.world summit on April 7 at 11am CT. I’d love it if you joined us.