Modern data catalogs must be able to scale up as demand for data and knowledge grows within the enterprise. But not every catalog has that capacity – many can only scale out to multiple instances.
Data and analytics ecosystems are evolving at an amazing pace. New data and analytics tools, systems, and assets come online daily and the companies we work for are complicated, agile and fast moving.
As this happens, your data catalog needs to adapt, change, and be able to represent all the different parts of your business to be able to paint the most complete picture of your analytics ecosystem.
It’s not uncommon for data.world customers to track thousands of metadata attributes across dozens of business units with hundreds of thousands of data assets each.
As data.world Chief Product Officer Jon Loyens wrote for Towards Data Science, “If you impose limits on what gets cataloged, you risk losing potentially critical context for your data.”
Three hallmarks of an extensible data catalog
In the context of a data catalog, extensibility relates to the platform’s ability to quickly and easily catalog new data sources without having to overhaul the underlying metadata models or configuration, forcing a redeployment of infrastructure. Your data catalog should be able to absorb new information about your data and analytics ecosystem or represent new lines of business without costly re-engineering. Here are the three hallmarks of an extensible data catalog.
As we recently wrote, if you want supreme flexibility, your data catalog needs to be cloud native. Traditionally built and deployed catalogs typically require software to be set up and/or hardware to be provisioned (either on-prem or by your cloud provider). They’re also notoriously slow to update, requiring extensive migrations when new versions appear.
Cloud-native data catalogs are fully managed, ensuring you get the latest version as soon as possible with zero-migration downtime. At data.world, we release more than 1,000 updates to our platform annually. Everything from small bug fixes to major feature releases are available to everyone – no waiting, no worry. You also don’t have to plan your catalog usage around scheduled downtime... because there isn’t any.
Powered by knowledge graph
In 2012, Demian Hess published an article in the Journal of Digital Media Management that stated:
“Digital asset metadata cannot be represented by a single, unchanging metadata model and schema, because the metadata are too variable, complex, and change too rapidly. Data architects need to embrace flexible models that allow metadata to vary across asset types and that can accommodate changes to the underlying schema.”
He concluded that although flexible models can be implemented using traditional relational databases, the best way to achieve the desired result is by leveraging graph technology, like a knowledge graph.
Knowledge graphs provide a semantic layer that your catalog uses to map complex data to familiar business terms like customer ID or city of residence. A major benefit of this is consistency of understanding; that is, every individual throughout your organization interprets fields in the same way which helps to minimize data brawls and increase data trust.
Another advantage of a knowledge graph is that you don’t have to entirely rebuild or restructure it to accommodate the following changes:
- Adding new information or processing new data
- Updating entities or metadata
- Adding or removing relationships between content
- Updating the query that maps the taxonomy/ontology to your content
A data catalog built on a relational database on the other hand requires significant time, effort, and resources to scale. That’s because they are inherently designed for much smaller, less distributed quantities of data than are common in enterprises today.
Flexible Metadata Model and Open APIs
The final hallmark of an extensible data catalog is having a flexible metadata model with an open API.
At data.world, we don’t limit your ability to represent unique systems, relationships, or properties. Just because we haven’t seen it before doesn’t mean it isn’t valuable. In fact, we make it simple for you to implement these changes by providing easy-to-use templates (like our data quality template). The beauty of powering your metadata through a semantic layer is that although your technology will change, your concepts, relationships, and meaning can persist and evolve over time.
Open APIs give you the ability to quickly and easily catalog new data sources for which an integration has not formally been built. Additionally, an Open API quickly and easily enables you to use your metadata to power downstream data applications as well (for example, loading the metadata into broader enterprise search applications). Additionally, it guarantees the portability of your metadata and helps minimize vendor lock-in.
Creating an integration or application with data.world empowers your users to connect, explore, and share data from across disparate sources and systems. Whether you’re building an entirely new application using data or integrating your own data products with data.world, you can make it easy for users to fully utilize the features of your product along with the powerful data resources available from data.world.
We offer a suite of materials that will assist you in using our API and other features to do everything from completing small tasks to developing large-scale data apps – we call this the Developer Toolkit. You’ll find information ranging from embedding content with oEmbed to the specifics of each endpoint included in data.world’s API.
Tour data.world’s extensible data catalog
Now that you know the advantages of an extensible data catalog, take a look at what data.world has to offer by visiting our product tour.