Today we announced our most important innovation in data.world’s history with the introduction of our new AI Context Engine™. You can read the fundamentals in our press release. Here I want to explore below why this really matters – and how it will shape the future of enterprise data and analytics.
For this new age of artificial intelligence (AI) is the moment when the long-sought goal of the data space pioneers – a goal with its origins in IBM’s invention of SQL (Structured Query Language) in the early 1970s, followed on by Oracle’s commercialization shortly thereafter – starts to become more and more achievable. This half-century marathon, now accelerating to a blistering pace toward its inevitable destination, means that soon you’ll be able to chat with your data as naturally as you might brainstorm with a room full of Einsteins who have access to all of the facts (i.e., the data) of your company, as I wrote about in May of last year.
A Tale of Two Phase Transitions
Let’s take stock of this incredible moment in history. In his profound and seminal book, The Inevitable – Understanding the 12 Technological Forces That Will Shape Our World, author and seer Kevin Kelly introduced an enduring analogy – the phase transition.
I want to talk a bit about that concept today, with two objectives in mind. One, I want to illustrate just how perceptive Kelly was in his 2016 grasp of AI and the planetary change it will bring as it couples with other digital technologies. And two, I want to introduce a narrower phase transition that we are enabling for customers with our AI Context Engine. It is, I believe, the most powerful force multiplier we have ever produced for business and enterprise in our company’s eight-year history.
It is frankly nothing less than the equivalent of an expanded “frontal lobe” atop the existing “brain” and “nervous system” that animate data.world. Or as Bryon Jacob, my co-founder and our CTO, put it in a very thought-provoking discussion on the Deep Future podcast with inventor and futurist (and data.world advisor) Pablos Holman: “I truly believe there’s way more juice for the squeeze” in extracting true insights from data with AI – in particular from so-called “structured” data – atop our new platform soon be be unveiled.
I’ll get to my problem-solving data cortex, and to Bryon’s “squeeze” of structured data – the spreadsheets, tables, and databases that typically comprise the most valuable data assets of an enterprise. And I’ll share why the impenetrability of that structured data, using the now-ubiquitous Large Language Models (LLMs) of generative AI, is such a pernicious problem – but one that we have solved with the reasoning of the AI Context Engine. But before we do, let’s get back to Kelly and his analogy of it all with the physical sciences to see how his transition is interwoven with ours.
The Redefinition of Human Existence
For in its meaning for physicists, a phase transition is a fundamental change in the state or nature of matter. The classic example is H2O. Below 32 degrees Fahrenheit (at sea level), this particular compound of molecules is a solid – ice. Above that temperature, it becomes a liquid – water. At 212 degrees that water becomes a vapor – steam. Kelly’s brilliant argument, which he explained some years ago in a riveting talk with The Long Now Foundation, is that the interconnected digital world is driving us into an analogous, fundamental, and irreversible transformation that will redefine human existence. Thoughtful use of AI, from my vantage point, is what will carry us through that phase transition to the other side.
As regular readers know, the challenges of the coming metamorphosis are ones I’ve weighed into frequently, particularly since November 20, 2022, when the advent of OpenAI’s ChatGPT inscribed the mind-bending power of AI into global consciousness. It was the fastest growing application in human history, reaching 100 million users two months after launch – a user tally that has now reached 180 million, with 1.6 billion visits last January alone and a revenue run-rate now of $2 billion.
This broad phase transition will upend our linear, K-12 and bachelor’s-to-master’s-to-PhD higher education models into a new and non-linear ecology of learning, as I wrote in a larger series last summer. In a review of another seminal book in October, The Coming Wave by Inflection AI CEO and co-founder Mustafa Suleyman, I explored how our knowledge and life itself are merging with AI-empowered synthetic biology. We will expand our cognition while modifying life, and even create new life forms. And on November 30, on the anniversary of OpenAI’s launch of ChatGPT, entrepreneur and author Byron Reese and I teamed up on an essay on what Byron has termed the “Agora”, an emerging planetary creature whose “DNA” is the aggregate of all the information in our brains, books, and the internet. We had a blast discussing that article and his latest book, We Are Agora, on the Austin Next podcast.
It’s BIG. You can compare it to the discovery of fire, the invention of the printing press, the splitting of the atom, or the advent of the World Wide Web. But the metaphors still fall short of expressing the social, economic, and civilizational change that is just over the horizon. At the heart of it all is, of course, data. And again, metaphors barely suffice as companies move vast amounts of business to the “cloud”, consumers leave data trails at every online transaction, and “smart cities” deploy sensors to monitor traffic, energy use, and even viral outbreaks in sewerage. Each day, medical teams take millions of digital X-rays and CT scans, 5G networks continue to multiply and grow worldwide, and our current use of some 15 billion-plus Internet of Things (IOT) devices grows by 20% a year. This is the scope and speed of Kelly’s data-driven phase transition.
This is also the boundless domain of endeavor which we dove into at data.world eight years ago. As technology journalist John Battelle, the founding editor of the Industry Standard, put it at our launch back in 2016:
“... data.world sets out to solve a huge problem — one most of us haven’t considered very deeply. The world is awash in data, but nearly all of it is confined by policy, storage constraints, or lack of discoverability,” Battelle wrote of us in our infancy. “In short, data.world makes data discoverable, interoperable, and social. And that could mean an explosion of data-driven insights is at hand.”
I think we’ve met John’s forecast and the confidence he vested in us. For so much of the vast data ecosystem was and remains primitive, characterized by hoarding, isolation, and fragmentation in the proverbial silos of varied departments, divisions, teams, and distributed sites. In contrast, we have been making the data ecosystem sophisticated and refined.
The Mastery of Meta-Data
Key to all of this has been the core technology of the data catalog, which is a metadata management system for the enterprises bulging with data. In short, our data catalog inventories, organizes, and makes it easier for users to find, analyze and trust their data (i.e., their facts), with the end result being faster and more precise decisions. Powering the catalog and all of our applications, including our new AI Context Engine, we have the foundation of our knowledge graph (with 76 patents since our inception). Our knowledge graph is a further refinement, one that represents the entities, attributes, and relationships of data across teams and silos, enabling complex queries and analytics across this lattice of interconnected institutional knowledge. Other applications powered by our knowledge graph include data governance, data lineage, our acquisition Mighty Canary (now named Sentries), and more.
We founded our AI Lab in February of last year, an industry first, followed a few months later with Archie Bots and “Interactions with Archie”, our own internal version of a GPT that allows customers to seamlessly chat with all of our product collateral, white papers, customer case studies, and more – another industry first.
To put this in broad context, the important book Winning with Data by venture capitalist Tomasz Tunguz and Frank Bien, CEO of the business intelligence platform Looker, wrestles with the skyrocketing problem of what they describe as, “the data-poor, waiting around at the end of the day on the data breadlines”. Protracted waits for data analyses slow workforce productivity. Those wait times then force employees to shoot from the hip (or give up), making decisions that without data are always imprecise. The results of cratering performance at enterprises are easy to imagine. And they are rife.
Our knowledge graph has been shortening “data breadlines” for the better part of a decade. Now, we’ll be eliminating them with AI.
Data-Driven Digital Cognition
As I wrote in a series more than two years ago on the history and evolution of data, the best way to conceptualize the catalog and knowledge graph is as the nervous system and brain that enable an otherwise primitive corpus of data to be converted into a source of enterprise-scale cognition. This brings insights together and accelerates precise decision-making.
In the case of we humans, our evolution ultimately began back with the Cambrian explosion of multicellular life, a half billion years ago. Of course there wasn’t a lot of cognition going on then, though zoologist Andrew Parker has advanced an intriguing theory that the fuse of intelligent life was the emergence of a single capability: the photo sensitivity that let the earliest organisms perceive and escape predators rather than simply take their chances as they floated through the primordial sea. We’ll save Parker’s “Blink of an Eye” hypothesis for another day, but it is the insight at the heart of ImageNet, an important project among the milestones of AI.
Our “cognition”, as we experience it today, traces back perhaps just 100,000 years judging by the records of symbolic thought and art. That’s about the point at which we can claim to have first had a fully healthy nervous system of the sensory inputs of sight, smell, touch, hearing, and vision, and the motor outputs enabling motion, breath, and organ function. These all interact through the control and command of the brain.
So to carry the metaphor, the equivalent of the Cambrian explosion, the “Big Bang” in the realm of data, is less than a half century old. Its origins are in what I described at the beginning of last year as the four “surges” of data evolution since the 1970s. Bring this maturation to our moment, and the data catalog is the full evolution of the data ecology. It is data-driven, digital cognition. It is the central nervous system of the enterprise; the sensory organ that connects sales, customer service, marketing, development, IT, HR, finance, supply chains, operations, and accounting. The knowledge graph, meanwhile, is the semantic architecture of meaning and reasoning; it is the learning organism of the enterprise.
In the process, we’ve also developed systems, tools, and protocols to manage the diversity and complexity of data, back to what I alluded to at the outset, that which we data nerds parse as “structured” or “unstructured” data. We have become adept at dealing with the reality that data is from Mars but data science is from Venus, as I once framed it.
But phase transitions move briskly. The advent of LLM-driven AI technologies’ diffusion has pushed the phase transition at the enterprise level into warp speed. This is unambiguously a good thing. The productivity boost can occur in a very short period of time and quickly form a compounding organizational and competitive advantage. I recently wrote about our own AI Operations, which have increased in productivity by 25% or more in just the last year for our employees who use these new AI-driven superpowers. The online fintech firm Klarna just reported that their new AI assistant for customer service already handles two-thirds of their service chats, projected to save them $40 million per year - and this is just in the first month.
The Achilles’ Heel of Large Language Models
There is a new and abiding challenge, however. The design of the LLMs around so-called “natural language” – with which we’ve all become familiar through ChatGPT and other chatbots – is both the technology’s supreme virtue and its Achilles’ heel,
LLMs can write a poem in the style of Shakespeare in seconds, or quickly summarize an annual report, or even help write computer code because of their adept pattern recognition capabilities derived from ingestion of huge volumes of unstructured text. At a company or enterprise, LLMs are great at reading and synthesizing emails, documents, and textual material. Navigating “structured” data however – from spreadsheets to massive databases – demands an understanding of the relationships between data points, which are often neither explicitly stated nor organized in a way that LLMs can comprehend and convert to useful information. Consequently, organizations have faced huge barriers as they endeavor to provide chat interfaces with structured data in any form, often struggling to create queries that yield actionable insight.
Traditional data catalogs may have AI-powered features and functionalities. But even with those utilities, an LLM can’t chat with structured data. Until now. The AI Context Engine unlocks vast amounts of previously inaccessible corporate and organizational data for use by teams harnessing the chat interfaces of LLMs.
“If you don’t choose a data catalog platform on a knowledge graph architecture, and bring in all your data and knowledge, governed in one platform, then you are setting yourself up for failure in an AI future,” said one of our customers, Vip Parmar, the global head of data management for WPP, a world leader in communications, experience, commerce, and technology.
Or as another early adopter, Michael Murray, chief product officer of Power Digital, put it: “If your organization wants the speed and fluidity that generative AI can bring to question answering, a data catalog built on a knowledge graph is critical. Without the metadata, a map of the organizational knowledge, and governance, it’s extremely difficult to develop impactful applications with generative AI.”
The assessments made by Vip and Michael reflect the dilemma that Bryon and Pablos unpack in their podcast conversation mentioned above. This insight is that LLM-propelled chatbots can deliver the sage advice of every business textbook in seconds. But they can’t help you understand your own business in more than a superficial way. And without access to your structured data that an LLM can effectively understand, the strategic insights – from profit maximization, to pricing optimization, to the critical path of reducing pressing problems like carbon emissions – are elusive despite the fact that the answers are buried in your own data.
The code cracking delivered by the AI Context Engine – when coupled with the knowledge graph – makes all of that possible. Led by our Principal Scientist Juan Sequeda, our AI Lab conducted a benchmark study that revealed a 300% improvement in the accuracy of queries on enterprise data. As Bryon framed it the conversation with Pablos: “As the complexity (of business data and decisions) goes up, the approach of using knowledge graphs becomes incredibly important really quickly. This is the cheat code for basically teaching LLMs how to deal with structured knowledge.”
The AI Context Engine will change the definition of our industry. Data cataloging, governance, and DataOps are now seen as critical for enterprises to power their AI future. Transforming your data into the true utility of insight and intelligence is the promise of AI. The AI Context Engine will revolutionize the way organizations build and scale AI applications to enable the chat-with-your-data future.
The Expanded Frontal Lobe
At data.world, we have long been giving our customers a nervous system and brain for their vast data sprawl. Now, we’re also giving them an expanded frontal lobe and cortex, a revolution that will speed their evolution. For as staggeringly impressive as LLMs are at metabolizing all the text on the internet and beyond, the dawning power of the technology will be when it can access the proprietary data now effectively walled off from even its owners. This new ‘frontal lobe’ not only enables enterprises to navigate the coming phase transition, it makes them the drivers and leaders of the transformation.
Capitalism always relentlessly seeks efficiency. Winning with data - and then winning with AI - is not optional in a world where more aggressive competitors will race ahead.
To expand your margins, become more resilient, discover new products, delight your employees and customers alike, and ultimately to accelerate your company through Kelly’s phase transition and beyond, deft use of LLMs across ALL your data and knowledge will quickly become table stakes in this era of AI.
The key question is: will you get there first, and then have the compounding competitive advantage you need going forward? It’s an incredibly exciting future, and we couldn’t be more proud to announce our AI Context Engine today.