“The story of data is replete with contests: contests to define what is true, contests to use data to advance one’s power, and, on occasion, contests to use algorithms and data to shine a light into darkness and to empower the defenseless.” – “How Data Happened – A History from the Age of Reason to the Age of Algorithms” by Chris Wiggins and Matthew L. Jones. Norton, 367 pp.

It hardly needs to be said that artificial intelligence, especially generative AI, has burst into relevance for virtually every aspect of our lives. I’m certainly among those who believe that with AI, humanity will flourish as never before. From resolving disparities in income and education to confronting challenges led by climate change and health care, AI is a tool with wide-reaching utility. As a favorite example, my favorite TED talk at this year’s TED was Sal Khan’s on Khan Academy’s use of OpenAI’s GPT-4 to create an AI-based tutor for both students and teachers as a great equalizer. The ever more lively discussion, however, is incomplete. For the discourse is focused on the remarkable tip of a technological iceberg – the riveting apps of ChatGPT, Bard, Claude, Pi, GitHub Copilot, and their countless cousins.

Sure, this family of AI apps is impressive and there’s much to talk about. But that discussion – centered on whether to embrace, to fear, to regulate, to pause, or accelerate – will be greatly enriched by a more robust focus on the mass of the technological iceberg below the AI waterline – the underlying evolution of data and its mastery. The bad news is that this broader and deeper  discussion still eludes us; so much of the global discourse veers toward dystopian science fiction and fantasy.

The good news, however, is that this yawning gap in today’s AI debate is filled elegantly by Columbia University professors Chris Wiggins and Matthew L. Jones in their indispensable new book that explores humanity’s long journey to today’s inflection point, a moment comparable to our discovery of fire.

The AI ecosystem has evolved to become “a mechanism for leveraging human judgment at scale”, they write, propelled profoundly by the leap a quarter century ago of Google founders Sergey Brin and Larry Page, who mastered analysis of metadata to move web search beyond the constraints of mere data mining. We’re now scores of technological leaps further. But, counsel Wiggins and Jones: “No matter how powerful the algorithm or extensive the data, if one fails to embed this data analysis within broader forms of knowledge, scientific and humanistic alike, that so-called knowledge should be seen as incomplete at least, dangerous at worst.”

I would frame that slightly differently, as a kind of duality. Yes, we need to embed data analysis with broader knowledge. But we also need to listen more effectively to what the data’s knowledge is already telling us.

This duality is only becoming more vivid as we witness the unfolding of AI at dizzying speed, in which we all experience and understand this pragmatically. Who among us has not, for example, in our early use of these tools, experienced the pivot between stunning insights and frustrating “hallucinations” simultaneously served up by your AI of choice? But don’t blame the Large Language Models, or LLMs. The root dilemma, I’ve argued, is that data is from Mars while data science is from Venus. Alignment of these metaphorical planets is a work in progress, and one we need to better understand.

Will We Regulate the Forest or the Trees?

My point here – and an implicit point of How Data Happened – is that we talk endlessly about the use or abuse of so-called “training data” for AI, so much so that public revelation of training data is the newest and most debated provision of regulation the European Union proposed several weeks ago for AI. It’s a body of rules that may become a model for emulation worldwide – perhaps for better but probably for worse. Yet all but absent from this global discourse is the creation, quality, and potent dynamism of the underlying datasets themselves. My own readers know that this is a topic near to our hearts at data.world, where our navigational tool known as the data catalog is quickly becoming the most widely deployed data tool of all time, reaching far beyond the traditional data user personas and into the larger organization as a whole.

I’ll get to this detailed yet accessible masterpiece of Wiggins and Jones in a moment. But first just a bit of my own snapshot of the landscape and its history.

As I wrote in April on the opening day of the annual TED Conference I joined in Vancouver, the current AI narrative is akin to watching a debate on the impact of the transport sector’s greenhouse gas emissions with virtually no discussion of the difference between the viscous bunker fuel in cargo ships, the refined petroleum used by most cars, the aviation turbine fuel used by jets, or the role of wind power enabling EVs. As the essential means of AI cognition, all datasets are not equal; they are no more alike than the various greenhouse gasses, including methane, carbon dioxide, and nitrous oxide, which impact and interact with our biosphere in profoundly different ways to create different outcomes.

Please don’t misunderstand. This frustration doesn’t mean I diminish the importance of the moment. I certainly share the belief of such seers as visionary technologist and venture capitalist, Marc Andresssen, who recently and eloquently argued that AI will in fact save the world. As he put it, this technology “matters perhaps more than anything else ever mattered”. We underestimate how humanizing AI can be, he writes, and why our need is less to have the government rescue us from AI, but rather to have AI rescue sectors burdened by government intervention – specifically inflation-riddled sectors like housing, education, and health care.

We need to go forward toward a data-driven AI with humans in charge, a case I made nearly a year ago before the AI acronym became such a household word with the launch of ChatGPT at the end of November 2022. But to do so effectively, we also need to grasp the deep history of data’s evolution that led us to this point.

As I wrote in a series back in 2021, data’s use traces at least to the Sumerians’ cuneiform tablets 5,000 years ago, but has only recently become the “nervous system and brains” of our civilization. In another series earlier this year, I wrote of data’s nearer evolution in the past half century, a chronicle of the last “four surges” of data’s development – including the most current surge, which is the story behind the story of the unfolding AI startup boom. It is a boom enabled in no small measure by the advent of the “knowledge graph” that is driving the surge in ways analogous to the dawn of genetic science enabled by the discovery of DNA’s double helix.

It’s against this historical framing of the emerging data-driven, AI-empowered society and economy that we’ve been shaping our own engagement with the AI revolution. From the work of my co-founder Jon Loyens to pioneer the concept of Agile Data Governance (evolving into our expanded data governance product announced on June 22) to our founding of data.world’s AI Lab, an industry first led by my colleague Dr. Juan Sequeda, to our new platform with generative AI bots to enhance data discovery and analysis productivity with global customers like WPP, we’re moving at the speed of AI. To my central point above, soon we’ll have more to say on the tools we’re fast developing to track the origin, transformation, and use of the datasets used to train LLMs – connected by data lineage – that will supercharge AI.

The Renaissance, Warfare, and Venture Capitalism

But this summary of my own thoughts and reflections is a snapshot; Wiggins and Jones have given us an intellectual history of data that is truly panoramic.

As How Data Happened is a sweeping book, I won’t attempt to summarize all the insights; different readers from various disciplines will take diverse lessons. But the insights I think most useful and informative to today’s AI discourse are these three:

And a coda if you will, across all these dimensions of data’s history, is that virtually nothing was inevitable. The book’s title is in fact ironic, as data didn’t just happen. Rather, the current state of our knowledge-driven society is the result of intentions, relentless work, and infinite human creativity. Data is made, not found. 

Let’s walk through these three insights and a coda in more granular detail:

From ancient China to pre-Columbian Incan civilization in the Andes, rulers and societies everywhere have long collected information about land and the people on it. But statistical thinking in Europe, and the colonies that became the United States, took off from the 1700s onward: “Numbers haven’t always been the obvious way to understand and exercise power,” Wiggins and Jones write. But by the 18th century, “War required money; money required taxes; taxes required growing bureaucracies; and these bureaucracies needed data.” Near the close of the 18th century, the new United States enshrined the census in our founding law, the Constitution. 

“Then as now, numbers were political,” they write.

The gallery of scientists behind the new enabling science of statistics is large. But among their numbers the Belgian astronomer Adolphe Quetelet, who gave us rules for the laws of averaging, is key. In a sense, you might argue his “body mass index” – the idea of a statistically average person – was the forerunner of today’s all-important Gross National Product, or the basket of indicators the Federal Reserve used last month to add another notch upward on interest rates – one hopes the last.

Most of us know of Florence Nightingale as the founder of the modern nursing profession, the woman who heroically cared for the wounded during the Crimean War in the 1850s. But less is made of the fact she was a serious statistician, whose data on mortality, length of hospital stays, readmission rates, and evidence of hygiene affecting outcomes revolutionized health care.

There is also Francis Galton, who gave us the famous “Galton Curve” to illustrate the “central limit theorem” better known as “regression to the mean.” But it’s not all heroics. Galton also coined the term “eugenics”, the idea of quantifiable racial superiority that not only reinforced many racist laws and practices in the United States, but also supported extremist politics in Europe to include the Nazis in justifying the horrors of genocide.

Then as now, data could be used for good or evil. Note that when we launched the manifesto for data practices, cowritten alongside some of the world’s most famous data scientists, we collectively put ethics front and center.

This section also includes pioneering use of data in business, notably by brewer Edward Guinness who hired a trio of scientists – William Gosset, Ronald Fisher, and Jersey Neyman – to improve the production of beer, the growing and refinement of ingredients, and the transformation of business practices. As just one example of the impact of Guinness and his team, one can draw a direct line from this account of math, science, and data use to Measure What Matters, a book by pioneering venture capitalist John Doerr whose work on “Objectives and Key Results” is foundational to our team at data.world, as I covered in detail in Chapter 19 of my book The Entrepreneur’s Essentials (which is available for free online at that link). And data in business decision-making has never been more imperative than now, in this moment of accelerating economic headwinds. 

All of this knowledge was accumulating, of course, when we got to World War II, and the now famous logician Alan Turing. Much of this history is certainly well known: the Bletchley Park code crackers who defeated the Nazi’s Enigma machine, the first mammoth ENIAC computer at the University of Pennsylvania in 1946, and the rise of Cold War science to include the founding of the National Security Agency, or NSA, in 1952, which within three years had 2,000 data-gathering posts around the world.

“For all the world of geniuses like Turing, Bletchley mattered because it made data analysis industrial,” the authors write.  

This section on the second half of the 20th century is important not just because we witness, to the authors’ point, how data analysis became “industrial”, but because it sets the stage for the ongoing tensions with the AI discipline today. While the long debate among scientists and engineers about the primacy of data and machine learning vs. logically-driven symbolic reasoning has faded, there lingers disagreement about the synergy between the two approaches and how to “train” the underlying data, how long to train it, and how to optimize the so-called parameters – the “common sense” of LLMs, as Yejin Choi covered at TED this year. At data.world, we would argue that knowledge graphs and the logical manipulation of data as formal symbols is fundamental.

“This was not data in search of latent truths about humanity or nature. This was not data from small experiments recorded in small notebooks,” Wiggins and Jones write. “This was data motivated by a pressing need – to provide answers in short order that could spur action and save lives. Answers that could only come from industrial-scale data analysis.”

I won’t belabor the scientific complexity and philosophical difference chronicled here that have divided artificial intelligence researchers for the past 60 or 70 years. The Head of our AI Lab, Dr. Juan Sequeda, has an interesting perspective on it in this article on knowledge graphs in the Communications of the ACM journal.

Even the scientist who coined the term “artificial intelligence” in the 1950s, the mathematician John McCarthy, was skeptical of data and focused on the idea that logic and symbolic reasoning were the essence of human intelligence and could be emulated in computers. In fact, the ascendance of data-driven AI, that behind the LLMs of OpenAI’s ChatGPT or Google’s Bard, is relatively recent. Even the term “data scientist,” is relatively new, having been coined in 2011 by DJ Patil, who served as the nation’s first chief data officer in the White House of Barack Obama and is a long-time member of our advisory board at data.world.

I won’t dive deeply into that long deliberation, well-chronicled by Wiggins and Jones. In short, data analysis won. At least for the moment. But it is important as background to the fact that AI developments are moving at such a pace that we have yet to develop a language to talk about AI effectively and meaningfully. This is my assertion, not the authors, but a precise and agreed-upon language is among many challenges on the journey to effective use of AI.

As the authors do succinctly note:

“Some fields, like biology, are named after the object of study; others like calculus are named after a methodology. Artificial intelligence and machine learning, however, are named after an aspiration: the fields are defined by the goal, not the method used to get there.” This is a profound observation.

The third of my takeaways from this book –  in some ways my favorite as a committed angel investor myself – is the observations Wiggins and Jones make on venture capital, an institution whose maturation really mirrors that of AI.

There’s no doubt, as the authors point out, that not all venture funds have been funded wisely, in technology or elsewhere. Nor have we resolved the conundrums of market-hungry algorithms that build audiences through the monetization of emotion and anger (i.e., enragement drives engagement).

But the old model of innovation – in their example Ford Motor Company founded in 1916 – moves at a pace of years, in which norms, laws, markets, and even technology itself, has decades to adjust. That model of 20th century capitalism is quaint by the standards of today’s software and information technology that disrupts old patterns and norms in mere months in some cases. 

“Venture capital (VC) has helped to expedite this disruption in that massive growth can precede revenue, and, in the case of consumer-facing companies, norms (and even more so regulation) that otherwise slow adoption of a new product,” the authors write.

This is another way of restating what we investors call the “power law”, the unique reality of venture capital in that it can endure risks that other investors avoid because it assumes losses on around 90% percent of ventures will be made up by the “grand slams” of the other 10%. 

Sure, the government, particularly the so-called military industrial complex, has been a funder of venture-backed startups. But in the broad scheme of things, the AI that we want and need – in health care, education, and shelter – will be responsible and thoughtful civilian VC investment. This must be part of our global debate as well, especially as herd insight kicks in and VCs race to back “the next big thing” of AI.

The final thought of the authors is that nothing is inevitable and the future is ours to decide.

“We don’t need to build and learn the stratification of the past and present and reinforce them in the future,” they write. … “it takes a while to reorient norms, laws, architecture and markets in a way that harnesses these emerging capabilities in order to empower the defenseless – but it can be done.”

I agree. And I also believe the wisdom and insight in How Data Happened is an important part of our roadmap to get there. We’ll keep racing along here at data.world to evolve the use of data and AI for the betterment of society and our customers. The announcements we’ve made on Archie Bots, Eureka Bots, and BB Bots are just the beginning.