Enterprises can use Large Language Models (LLMs) to explore innovative opportunities and extract more value from their data, enhancing processes and developing new products and services. However, building data interactions with LLMs brings persistent concerns about their accuracy in production environments. A major issue is that LLMs can generate responses with false information and fabricated citations, known as "hallucinations."
Experts from various fields, including academia, database companies, and industry analysts like Gartner, are researching how pairing LLMs with knowledge graphs can improve LLM response accuracy. Knowledge graphs map data to meaning, capturing both semantics and context. This flexible format provides the context that LLMs need to answer complex questions more accurately. In their first benchmark study, Juan Sequeda Ph.D., Dean Allemang Ph.D., and Bryon Jacob, CTO of data.world, found that using an LLM backed by a knowledge graph improved LLM response accuracy by 3x compared to an LLM relying solely on a SQL database.
Continuing their research in a second study, Drs. Sequeda and Allemang examined whether an LLM’s response accuracy could be further increased by leveraging the ontology of the knowledge graph to check for errors in the generated queries and using an LLM to repair incorrect queries. This innovation is a component of data.world’s AI Context EngineTM. This benchmark found a significant improvement in response accuracy (4.2x) when backed by a knowledge graph and using the Ontology-based Query Check (OBQC) and LLM Repair approach.
Specifically, top-line findings include:
A knowledge graph improved LLM response accuracy by 4.2x, improved from 3x in the first benchmark study
The approach of using Ontology-based Query Check (OBQC) and LLM Repair for LLM-generated queries helped to increase the overall LLM response accuracy
All high schema complexity questions had substantial increases in accuracy
70% of the LLM repairs where done by rules checking the body of the query. The majority were rules related to the domain of a property.
Download: Building the Foundation for Scalable AI Whitepaper
Learn how your organization can build AI-powered applications that generate accurate, explainable, and governed responses.
A comparison: Answering complex business questions
Both this benchmark and the original use the enterprise SQL schema from the OMG Property and Casualty Data Model in the insurance domain. The OMG specification addresses the data management needs of the Property and Casualty (P&C) insurance community. Researchers measured accuracy with the metric of Execution Accuracy (EA) from the Yale Spider benchmark.
Against this metric, the benchmarks compared the accuracy of responses to 43 questions of varying complexity, ranging from simple operational reporting to key performance indicators (KPIs).
The benchmark applies two complexity vectors: question complexity and schema complexity.
Question complexity: Refers to the amount of aggregations, mathematical functions, and the number of table joins required to produce a response.
Schema complexity: Refers to the amount of different data tables that must be queried in order to produce a response.
Four categories of questions are tested: low-question and low-schema complexity (day-to-day analytics), high-question and low-schema complexity (operational analytics), low-question and high-schema complexity (metrics & KPIs), and high-question and high-schema complexity (strategic planning)
Updated approach: Ontology-based Query Check (OBQC) and LLM Repair
To continue developing techniques to improve accuracy, this benchmark used Ontology-based Query Checks (OBQC) and LLM Repair in its approach and examined the execution of a SPARQL query generated by an LLM for each of the 43 questions.
The OBQC determines if a query is valid by applying rules based on the semantics of the ontology, checking both the head and body of the query. If an error is detected, the LLM Repair process is initiated, where the LLM is re-prompted to correct the query based on the OBQC's feedback. This iterative process continues until a valid query is generated, otherwise, an "unknown" result is reported after three iterations, which is preferable to an incorrect answer (a hallucination).
The original benchmark showed an average 3x improvement in response accuracy and marked improvement in each category – even the high-schema complexity questions that stumped the LLM alone. The OBQC and LLM Repair approach increased the overall accuracy of all questions to 72.55%, which results in a 4.2x improvement compared with responses relying solely on a SQL database.
The improvement from the first time query execution vs query execution enhanced by OBQC and LLM Repair can be seen here:
Low Question/Low Schema (day-to-day analytics): From 51.19% accuracy → 76.67% accuracy with repairs
High Question/Low Schema (operational analytics): From 69.76% accuracy → 75.10% accuracy with repairs
Low Question/High Schema (metrics & KPIs): From 17.20% accuracy → 76.33% accuracy with repairs
High Question/High Schema (strategic planning): From 28.17% accuracy → 60.62% accuracy with repairs
The authors also included metrics in the report for the “unknowns” where the LLM cannot produce an updated query via repair after three iterations, and “I don’t know” is considered a valid answer and arguable a better answer than an inaccurate answer. This occurred 8% of the time, which produces a final error rate of 19.44%.
Overall, these results support the main conclusion of the research: investment in metadata, semantics, ontologies, and knowledge graphs is a prerequisite for achieving more accurate and trusted LLM responses for AI-powered question-answering applications.
The Future of Trusted GenAI Using Knowledge Graphs
Knowledge graphs, when used with LLMs, can help shape AI innovation by eliminating critical barriers and increasing widespread trust and adoption in Generative AI. Leveraging the knowledge graph to enhance accuracy is a key component of the AI Context Engine. This benchmark study continues to demonstrate the significant impact knowledge graphs can have on the accuracy of Large Language Models (LLMs) in enterprise settings, not only by providing business knowledge to organizational data but also by identifying and repairing errors in queries. As LLMs continue to improve and knowledge graph techniques are refined, the authors will continue to document these advancements, reinforcing the continuous evolution of this technology.
Even as LLMs and knowledge graph techniques improve, trust and accountability remain essential. By leveraging a data catalog built on a knowledge graph, like the data.world Data Catalog Platform, and providing the necessary business context for an organization’s data with the AI Context EngineTM, enterprises can bring accuracy and explainability to LLMs and govern questions, responses, and queries.
Download: Building the Foundation for Scalable AI Whitepaper
Learn how your organization can build AI-powered applications that generate accurate, explainable, and governed responses.
For more information on the study, listen to Juan Squeda and Dean Allemang’s talk on this benchmark study at the Alan Turing Instutite on May 20, or read the paper, Increasing the LLM Accuracy for Question Answering: Ontologies to the Rescue!
To learn more about the AI Context EngineTM, schedule a demo.