Do you spend hours searching for the one person in your company who holds the keys to the data you need? Once you have the data, do you struggle to understand it? Do you even trust it?
If any of this sounds familiar, you’re in great company. IBM estimates that the US economy loses around $3.1 trillion annually on bad data, while The New York Times reports that data scientists spend anywhere between 50-80% of their time on mundane janitorial data work. So what’s the solution?
Data catalogs are one increasingly popular way to address the challenge of finding, understanding, and trusting data. But if you go down this path, you must answer the eternal enterprise software question: Should you build it or buy it? We’ve seen success and failure on both sides, so let’s consider the options.
A handful of companies such as Uber, LinkedIn, and Airbnb have successfully built and launched their own internal data catalogs (most on knowledge graphs). But they stand apart from hundreds of companies that have wasted months (often over a year) of time and money only to roll out a catalog that fails to gain adoption.
Many companies alternatively choose to buy a catalog outright. Some multiply their business value significantly, while others waste hundreds of thousands of dollars on shelfware.
An x-ray into the building process
You’re probably wondering how Uber and Airbnb built their own catalogs. Is it possible to model your data catalog initiative after theirs? I mean after all, those rollouts were pretty successful. Let’s take a look at how they did it.
You need people
Building your own catalog is an expensive and labor-intensive data project. That’s because it requires a dedicated team to essentially develop a new product. They’ll need to research the best way to implement, then develop the core product, and launch it internally.
Based on what we know of homegrown data catalogs, we estimate that most organizations need at minimum five engineers assigned to the catalog permanently. This number is even higher during the building and implementation stages, where having around a dozen full-time engineers is not unheard of.
The people cost alone puts this project well into the seven-digits, not to mention the lost productivity by shifting these resources away from your organization’s core projects.
It takes time
Even with open source data catalogs such as CKAN, Amundsen, and others available on the market, companies continue to struggle to launch their data catalogs. These offerings might be free on paper, but they come with a lot of baggage.
We recently spoke with a major US financial company that walked us through their frustrating journey building a catalog. It took more than eight months to deploy their data catalog– a replacement for an existing platform that wasn’t getting the job done. They were stuck in a spiral and not getting anywhere despite having the budget and resources to take on the challenge.
When thinking about building, consider the competitive landscape. Can you afford to go another year without a solution to your data management problems?
Plans and designs for the present
Companies are investing more in their data strategies, but throwing additional funding at a project doesn’t always yield better results. Homegrown catalogs often run into an issue that money can’t immediately solve: they design and build for today, for a catalog that won’t deploy for a year. That means on day one, the catalog is using yesterday’s technology.
They become obsolete in two ways:
- They likely won’t support the tools you’ll be using a year from now
- Data catalog standards change quickly, and it’s tough to keep up (compare Gartner’s top data trends in 2019 vs 2020 to see how fast the industry evolves).
You can avoid obsolescence by keeping your in-house catalog development team in place permanently, alongside your IT team.
For the few
Let’s say your do-it-yourself catalog is technologically viable. It can pull data from various sources, and provide a catalog experience similar to your computer’s file browser. But the secret sauce is still missing. That’s because your team of engineers is building a platform that meets their needs without taking others into consideration.
That is, the platform is likely built for more technical people, not the business users. There is no clean, unified UI. Consider the many internal tools that you work with – no matter your industry or your role. UX is an often ignored aspect that is sidelined. And for a catalog to succeed, at its core it needs users from all areas of the business, regardless of technical knowledge.
Internal catalogs often have no Facebook-, Google-, or Amazon-like experience for collaboration, search, or browse. These user-friendly experiences are often sidelined in development as companies seek to tick a list of feature boxes, neglecting the needs of the business users, the ultimate consumers of data.
The end result is business users are naturally discouraged from using the platform, and building a data-driven culture becomes impossible.
Building is just the beginning
Building a catalog is daunting enough, but construction isn’t all you’re signing up for (remember: open-source is not a free investment). You also need to manage it, maintain a development cycle, and resolve technical queries and issues that eventually pile up.
The work doesn’t end once your data catalog is up and running. We’ve heard that companies that manage their own catalogs have up to seven dedicated data engineers on the project. This alone can cost multiple times more than a fully-managed, cloud data catalog (when weighing your options, note that not all data catalog vendors offer fully-managed solutions – watch out for this potential cost).
The hidden cost of this long-term investment should not be ignored. You’ll likely pay 2x to 3x more to maintain a stagnant, outdated piece of software versus buying a catalog with continuous updates and support costs built in.
Data and analytics leaders require an ever-increasing velocity and scale of analysis in terms of processing and access to succeed in the face of unprecedented market shifts.
– Rita Sallam, Distinguished VP Analyst, Gartner (source)
The most well known catalogs are government data portals such as data.gov or data.gov.uk built on CKAN, for example. Do they functionally work? Absolutely. But the way data is being used and the scale of data production has changed significantly since they launched these open data portals – they simply can’t keep up. Here’s why:
Data is hard to find. You could get a non-obvious deprecated dataset as one of your first few results when searching.
Data is hard to use. Datasets are distributed as Excel or zip files, need to be cleaned and normalized, then plugged into another tool for analysis.
Data is hard to understand. Limited documentation exists outside of a contact email and the name of the dataset.
The cost adds up, fast.
In many ways, maintaining a data catalog for the long haul is spending on life support, not innovation. For companies like Uber, LinkedIn, and Airbnb, having resources, and a clear point-of-view have enabled their success.
A mature, data-driven culture from top to bottom is a must-have to build a catalog successfully. For organizations still looking to introduce a shift in the way data work is done, building before investing in culture introduces significant risk.
Is buying the fool-proof answer?
Designed for the user
Data catalog tools all take different approaches, and you need to find one that best positions your business to get the most value out of your data. Central to this is enabling your entire organization to work with data.
That intuitive experience stems directly from investing in users and design, through having a dedicated UX team, and regularly conducting user interviews to inform design decisions. These intentional development choices result in building a data management and governance experience that is both agile and resilient, flexible to multiple use cases.
Don’t overlook this, otherwise your data catalog evaluation and purchase is destined to be like many other expensive software purchases: shelfware.
Software and service
Most leading vendors have a service component along with a data catalog purchase. Catalogs alone don’t solve data problems; you also need the human touch. This is often the most ignored part of homegrown data catalog initiatives and is almost a guarantee the investment will fail.
That’s why engagement is important. Because that continuous training and reinforcing of best practices will only enable you and your team to be more successful. You need to work with a vendor that’s open to sharing expertise, is responsive, and continues to advocate for your success well beyond your initial engagement.
Empowering your hidden data workforce is the not-so-secret ingredient to success. Vendors that place user adoption at the center are a strong signal they are looking to help you develop that data-driven culture, and ensure your business exceeds the returns they expect from a catalog investment.
Finally, you need a data catalog vendor that’s a trailblazer in setting data practices for the industry. Consider what their roadmap is and why they’ve chosen to approach it that way. Who are their advisors and do they consult with them as they make their decisions?
These are important questions to ask that clarifies what the vendor’s vision is for the future, how they see tomorrow’s world of data work, and the reasons and methods they used to develop that perspective.
Build an enterprise data catalog if you’ve set appropriate expectations: the scope of what it will do technologically and the development and maintenance resources that will be set aside specifically for this project indefinitely. UI is the differentiator: if your main users are technical users, then an in-house catalog built by them will be most optimized for them to get value out of data.
On the flip side, if you need to hit the ground running with a solution that works for your entire team, buying a cloud data catalog is the faster and more agile option today. Don’t forget to lean into your catalog vendor’s expertise as you’re creating a data-driven culture, and work with data comfortably knowing you have the latest technology on the market.