In our era of ever-growing amounts of complex data, leading businesses and organizations are turning away from the traditional relational database and toward knowledge graphs to help them make sense of their huge quantities of diverse and complicated information.
But why? What is a knowledge graph, and what advantages do they provide in regard to data management and governance?
In this comprehensive guide, we'll explain what knowledge graphs are, how they are created, and how they benefit enterprise data teams. We’ll also discuss the role of computer science, the semantic web, and knowledge management in knowledge graph creation.
And, most importantly, we’ll dive into why it’s crucial that a modern enterprise data catalog — like data.world — be underpinned by a knowledge graph.
Introduction to knowledge graphs
So what is a knowledge graph? Knowledge graphs are a way of organizing and representing information in a machine-readable format. A knowledge graph represents a collection of real-world concepts (displayed as nodes) and relationships (displayed as edges) in the form of a graph used to link and integrate data coming from diverse sources. They bridge the “data-meaning gap,” connecting business terminology and context with data and enabling data access via a commonly understood language, dramatically improving search, findability, clarity, and accuracy.
Different from relational databases, a knowledge graph is a type of graph database that connects data points via semantic relationships, displaying them in a graph representation. Knowledge graphs are important because they allow computers to understand the context of information and make connections between seemingly disparate pieces of data.
For example, the Google knowledge graph uses semantic relationships to connect people, places, and things. Because of this, when you search for a famous person Google’s knowledge graph allows the search engine to display a summary of their life, their works, and related people or events. This functionality — again, courtesy of a knowledge graph — allows users to quickly get a comprehensive overview of a topic without having to sift through multiple search results or web pages.
data.world’s enterprise data catalog connects data the same way, via semantic relationships to provide a top-down view of your data and how it interacts with your other data across the enterprise. Not only does this provide context to the data you’re analyzing, but it makes it easy to find exactly what you’re looking for… just like Google.
What is a knowledge graph?
A knowledge graph’s collection of data points and semantic, contextual relationships represents a particular domain of knowledge. The context provided via the relationships allows people and computers to understand how different pieces of information relate to each other within a data model. Knowledge graphs are often depicted using nodes and edges in a graph representation, with nodes representing entities (such as people, places, and things) and edges representing the relationships between them.
Modern knowledge graphs can be created for any type of domain, data model, or industry, including medicine, finance, and social networks. A knowledge graph can be used for a variety of purposes, such as search engines, recommendation systems, and data integration.
The evolution of knowledge graphs and its impact on technology
The idea of a knowledge graph has been around since the 1960s, but it wasn't until the rise of the internet that it became feasible to create large-scale knowledge graphs. Additionally, over the past decade, advances in machine learning methods and natural language processing have made it easier to extract information from unstructured data sources and create more complex relationships between data points.
Today, knowledge graphs are used by companies like Google, Amazon, Netflix, and Facebook to power their search engines and recommendation systems. They are also used by enterprise businesses, governments, and academic institutions to manage large-scale datasets.
How knowledge graphs are created: Data integration and data points
Creating a knowledge graph requires compiling data from different sources and standardizing it to make it machine-readable. This is called data integration, and it can be a time-consuming process, as data might need to be cleaned, transformed, and merged to ensure consistency.
Once the data has been integrated, it can be represented as data points. A data point is a piece of information that is connected to other data points through semantic relationships. These relationships can be simple (such as "is a part of") or complex (such as "works for" or "is married to").
Machine learning and natural language processing in knowledge graph development
Machine learning and natural language processing are two key technologies that are used in enterprise knowledge graph development. Machine learning algorithms can be used to extract information from unstructured data sources, such as text documents and images. Natural language processing can be used to understand the meaning of text and identify semantic relationships between entities.
For example, if you wanted to create a knowledge graph of a social network, you could use machine learning algorithms to extract information about people's relationships, interests, and activities from their social media profiles. We could then use natural language processing to identify the semantic relationships between these entities, such as "is friends with" or "likes".
Understanding the role of computer science in knowledge graph creation
Computer science plays an important role in knowledge graph creation, as it provides the tools and techniques needed to process and analyze large-scale datasets. It also provides the basis for graph theory, which is used to represent and manipulate graph data models.
Computer science is used to develop algorithms and models for data integration, natural language processing, and machine learning. These algorithms and models are used to extract information from unstructured data sources and create semantic relationships between data points.
Semantic web and knowledge management in knowledge graph development
The semantic web is a set of technologies and standards that are used to create machine-readable data on the internet. It provides a way for developers to create and share data in a way that computers can understand. Knowledge management, on the other hand, is the process of creating, sharing, using, and managing knowledge and information within an organization.
Knowledge graphs are often used in conjunction with semantic web and knowledge management technologies. They allow organizations to create machine-readable representations of their data and share it with other organizations in a standardized format.
Knowledge representation and graph databases
Knowledge representation is the process of creating a formal model of knowledge in a machine-readable format. It involves representing knowledge as a set of concepts, relationships, and rules. Graph databases, on the other hand, are databases that store data in the form of nodes and edges.
Enterprise knowledge graphs can be represented using graph databases. This allows organizations to store and query their data in a way that is optimized for graph-based queries. It also allows them to create complex relationships between data points and easily navigate their data.
The modern knowledge graph: Acquiring new data and domain knowledge
The modern knowledge graph is constantly growing and evolving. They are not static datasets, but instead are constantly acquiring new data and domain knowledge. This allows them to remain relevant and useful in a rapidly changing world.
To acquire new data and domain knowledge, organizations can use a number of techniques. They might use web scraping to extract data from the web, or crowdsourcing to gather information from users. They might also use machine learning algorithms to automatically extract new relationships between entities.
Artificial intelligence and semantic relationships
Artificial intelligence (AI) is a broad term that refers to any technology that can perform tasks that normally require human intelligence, such as learning, reasoning, and perception. AI is used in enterprise knowledge graph development to automate the process of extracting information from unstructured data sources and creating semantic relationships between entities.
For example, AI algorithms cant be used to automatically extract relationships between people based on their social media activity. These relationships could then be used to create a social network knowledge graph that would allow users to explore their connections with others.
Data management in knowledge graph development
Data management is the process of storing, protecting, and maintaining data. In enterprise knowledge graph development, data management is particularly important because knowledge graphs are often used to store large-scale datasets.
To manage their data, organizations can use a variety of techniques, such as data backup and recovery, data security, and data governance. They can also use data visualization tools to provide context and help them understand their data and identify patterns and trends.
The future of knowledge graphs: The open linked data movement and machine learning algorithms
The open linked data movement is making it easier for organizations to share their data with others in a standardized format. This will allow for more seamless integration of data across different domains.
In addition, machine and deep learning and algorithms are becoming more sophisticated, allowing for more accurate and automated extraction of information from unstructured data sources. This will make it easier for organizations to create and maintain complex knowledge graphs.
Applications of knowledge graphs in various industries
Knowledge graphs have a wide range of applications, making them useful across numerous industries. In healthcare, for example, a knowledge graph can be used to help doctors and researchers better understand complex diseases and their treatments. For financial institutions, knowledge graphs are used to identify patterns and trends in financial data and make better investment decisions.
Other industries that use knowledge graphs include e-commerce, social networking, and transportation. The possibilities are endless, and as knowledge graphs become more sophisticated, their applications will only continue to grow as they deliver significant competitive advantage.
What are the benefits of a data catalog powered by a knowledge graph
When used to power a data catalog, knowledge graphs help make sense of large-scale datasets and extract meaningful insights from them.
Google popularized the term “knowledge graph” in 2012 with an article introducing their “things not strings” approach to search. In it, they highlighted three main benefits of knowledge graph:
Find the right thing – Results are more relevant because knowledge graph understands entities, and the nuances in their meaning, the way you do.
Get the right summary – Knowledge graph better understands your query, so it can summarize relevant content around that topic, including key facts you’re likely to need for that particular thing.
Go broader and deeper – Make unexpected discoveries through knowledge graph suggestions.
What is a data catalog?
Because of these benefits, one extremely effective application for a knowledge graph is as the underpinning for a data catalog. A data catalog is a metadata management tool that companies use to inventory and organize the data within their systems. The business goal of a data catalog is to empower your workforce so they can get more information from your data investments, break down data silos, gain better data insights as a whole, and make better decisions quickly.
To accomplish this goal, an enterprise data catalog needs to create and manage collections of data from multiple sources and the relationships among them in your organization, and provide a unified view of the data landscape to data producers (e.g. data engineers, data stewards) and data consumers (e.g. data scientists, data analysts). These collections include tables and columns of a database, business glossary terms, analysis, and reports from BI dashboards. A key takeaway is that managing relationships should be the bread and butter of data catalog tools. That is where knowledge graphs come in.
A data catalog powered by a knowledge graph delivers the same benefits within your enterprise, and builds upon them to deliver greater value to data teams in the enterprise:
Improve search accuracy – Metadata and data are logically organized and in machine-readable format, speeding search and discovery.
Activate your metadata – Analyze and traverse lineage to understand changes to metadata, connect concepts, terms, or metric definitions to “physical” tables and columns.
Enhance data governance – Map data assets to key enterprise concepts to make them discoverable and accessible for greater user self service.
Additionally, knowledge graphs enable fast, flexible, and scalable cataloging of data and metadata. You can efficiently onboard, integrate, and catalog any new data source including semi-structured and unstructured in a matter of days – relational data catalogs are rigid and inflexible, taking months to do the same.
Why is it crucial for your data catalog to be powered by a knowledge graph?
As you now know, knowledge graphs enable the integration of knowledge and data at a large scale in the form of a graph data model. But unlike a traditional relational database, knowledge graphs are, by definition, flexible and agile. This means a knowledge graph can grow to accommodate not only however much data you have, but also whatever type of data you have, ad infinitum.
By building your data catalog on a knowledge graph you get the flexibility of extending that same graph model across any new sources of data that you acquire or spin up. And you can easily connect the data to your own business terms. A knowledge graph makes it easy to extend the model to represent concepts and relationships that may have not been defined before without costly and time-consuming infrastructure changes.
The very nature of a knowledge graph makes it easy to extend your catalog alongside your growing data ecosystem. That’s why data-driven leaders like Airbnb, Lyft, and LinkedIn have built their catalogs on a knowledge graph.
Why a knowledge-graph-powered-data-catalog is superior to a traditional relational database
Data catalogs powered by traditional relational technology are rigid and inflexible. This means it can take months to support new types of data sources.
Conversely, knowledge graphs are well-suited to organizations with large data sets and where extracting knowledge is overly difficult.
Without a knowledge graph powering your data catalog, you can’t properly integrate new knowledge, and data across your organization. And beyond that, without a knowledge graph, your data environment and sources may eventually outgrow your catalog’s capabilities.
Furthermore, as data.world co-founder and CTO Bryon Jacob wrote for Forbes, knowledge graphs are well-suited to organizations with large data sets and where extracting knowledge is overly difficult:
For example, an organization might use a variety of data and content management systems, all without the ability to communicate with each other. Or data may be structured in a way that's incompatible. Or there may be little horizontal collaboration between teams and departments.
A data catalog powered by a knowledge graph addresses these issues by taking a collection of messy, unstructured and scattered data sources, unifying them, and integrating them with the knowledge that gives data meaning, so they can be better analyzed and provide more immediate value.
How can you identify a true knowledge-graph powered data catalog?
The adoption of knowledge graph models and their myriad benefits has made plenty of noise in the world of data. So much so, in fact, it seems every data catalog is powered by a knowledge graph, some seemingly magically switching from traditional relational technology overnight!
But just because a data model is called a “knowledge graph,” it doesn’t make it so. When you look beyond the marketing hype, you’ll see true knowledge graphs possess three distinct characteristics:
Three characteristics of a true knowledge graph
1. It has an ontology
First of all, a real, legitimate knowledge graph has an ontology, which serves to give definitions and create a formal representation of the entities in the graph and explain how they’re related. In short, it tells you what everything in your knowledge graph means.
The ontology is more than just semantic metadata or a picture of a schema. It should be machine readable. It should also reuse existing best practices and standards. The ontology underlying data.world’s knowledge graph consists of:
DCAT to represent a data catalog
Dublin Core to represent metadata
SKOS to represent glossaries and thesauri
PROV to represent provenance and data lineage
2. It's extensible and flexible
Any data catalog truly powered by a knowledge graph should be able to add, integrate, and catalog any data resource to your ontology immediately. It should be completely extensible and flexible, allowing you to add new data resources and ontologies and support any kind of additional data you want, whenever you want, no ifs, ands, or buts.
If someone claiming to run their data catalog on a knowledge graph tells you they can’t support adding resources of a certain type, that some data has to be siloed, or that it’ll take months to make the necessary changes in order to do so, they don’t have a real knowledge graph.
3. Everything in your knowledge graph is queryable
If a data catalog is powered by a true knowledge graph, you should be able to query all your metadata, everything that’s in there. This means that your metadata resources should be represented in standard graph format such as RDF (Resource Description Framework), your ontology is going to be built in OWL (Web Ontology Language), and you can query it with SPARQL (SPARQL Protocol and RDF Query Language). You should effectively be able to write any query you want; you should be able to ask anything, like you’re on Google.
If you can’t query all the metadata within your data catalog, sorry, your data catalog isn’t powered by a knowledge graph.
Conclusion: The power of knowledge graphs
Knowledge graphs are an incredibly useful and powerful tool for organizing and representing complex information. They allow people and computers to understand the context of information and make connections between seemingly disparate pieces of data. As technology evolves, so too will the possibilities for knowledge graph applications. And as we move into an increasingly data-driven world, knowledge graphs will become even more important.
Header image credit: Image courtesy of Urupong