What do you do first, build a data warehouse or catalog your data? It’s not quite the chicken and egg question, but it is a dilemma that many enterprise data leaders face. In this blog, we’ll explore why and how you should do both at the same time. Read on if you’re modernizing your data architecture. If you already have a data warehouse, then it’s time for a data catalog.
Choose your fighter: data warehouse or data catalog
Today is a renaissance era for data management and storage technologies. Amazing cloud services like Snowflake, Google BigQuery, and Databricks Delta Lake make it easy to create, administer, and maintain data warehouses and data lakes. Of course, Microsoft and AWS also offer a variety of tools for storage and query that suit nearly any use case as well.
It’s not surprising that many of our customers and prospects are replacing their rigid, inflexible legacy data warehouses and lakes with scalable, low-maintenance cloud platforms. However, providing discovery and understanding of the data assets that are available in an enterprise is critical to adoption. That’s where a data catalog comes in.
It’s natural for data management professionals to ask, “should I build my new data warehouse first or focus on building a data catalog?”
The argument for starting with a data catalog is that you can understand what data is already available and what is actually of value before migration. The counter-argument is that if you stand up a high-value data warehouse first, you can build your catalog for the new environment and avoid repeated work.
Both approaches are actually wrong and run counter to agile principles for software development and data governance. By phasing these projects, you’re falling prey to old school waterfall methods that delay or minimize ROI, fail to build community, and put adoption at risk.
If you attempt to catalog your entire enterprise data landscape first, you’ll find yourself weeding through irrelevant, unused data sources and noisy query logs before you find anything of value to move over to the new platform. If you attempt to build out your warehouse first, you’ll be in an unending requirements-gathering quest trying to figure out the complete set of assets to be migrated. Once the migration is complete, like any other project, will you really want to do all the documentation post-hoc? You’ll accrue so much knowledge debt that paying it down will be a Sisyphean task. Both of these ideas represent boil-the-ocean thinking that is proven to not deliver results.
Get it right by applying agile methods to data stack development
Building your new data platform at the same time as your enterprise data catalog lets you reap the benefits of your new data platform instantaneously. It also helps you avoid the pitfalls of typical waterfall development methods that plague data and analytics and slow innovation. Adopting agile principles in your data governance and management process will get your organization ROI on modern tools faster than ever.
The methods described below have proven successful both at data.world and with our customers:
1. Build an analytics backlog
Create a list of metrics that the business needs or wants. It’s usually best to phrase the metrics as questions like, “what is the daily average session length of visitors to our website” or “what is our average order value for a certain time period?” This is the equivalent of user stories in software development. By starting with high-value questions, you’ll see patterns emerge that can help with your next step.
2. Decide on an architectural style
Like a well-designed application or piece of software, the data in your data warehouse or data lake should conform to an architectural style. You can select the style based on the kinds of questions in your analytics backlog and the shape and types of data you predominantly have available in your enterprise (star-schema, snowflake-schema, data vaults, and many other denormalized formats). Think about layers of data models from raw data to clean data to transformed analytic models. You can compare this layering to layering software from raw API to business logic to UX.
The architectural style you choose will have a big impact on how analysts and data scientists access and use the data. Applying this style consistently will make your data platform much more usable for all data consumers. At data.world, we use a star schema layout and ELT (extract, load, transform) architectural pattern. The star-schema layout of fact and dimension tables works particularly well for tracking the activity of our membership base but also pivoting our analytics on time period or by customer org.
3. Select a toolchain
Once you have an architectural style and a backlog of analytics stories, it’s time to choose some tools. How well these tools work together is critical to maintaining agility in a world with ever-expanding data science and analytics use cases. Different data platforms support different architectural styles. The linchpins of the toolchain are your data platform/query layer, your ETL/data-integration tooling, and your data catalog.
Data quality, profiling, lineage, and other tools can be integrated as your use matures. Having a data catalog with an open and flexible metadata model is critical to adding new tools over time. It also gives you the basis to expand your BI, ML/AI, and data science toolbox to support data consumers over time as well. At data.world, we’ve adopted JIRA to manage our analytics backlog, Snowflake for our data platform, DBT for transforms, and a variety of analytics tools. All of this is coordinated via a data.world data catalog.
4. Gather your team
Now it’s time to bring together the data consumers and producers who will be working on the initial analytics stories. Good agile processes incorporate a diversity of stakeholders at every touchpoint. This keeps feedback loops tight and might be the single most important thing that drives adoption. Consider who will coordinate your data sprints as well. Anoint someone to play the role of data product manager or owner at this point too. Data engineers, stewards, and product managers cannot go into a cave for months only to emerge and expect analysts and data sciences to start using the results.
5. Pick your first analytics stories
In classic agile/scrum fashion, now is the time to group, prioritize, and select the first set of stories to tackle. All the stakeholders should be involved in this process. Grouping can be done using traditional techniques like card sorting and affinity exercises. Sizing, business impact, and the team available can also play a role in which stories get done first.
Make sure to keep the analysis concrete, not hypothetical. Pick stories that are closely tied to jobs that the data consumers need to get done so that clear, measurable value is delivered in the end. Additionally, time box these deliverables and set a date to measure the results. This will help you reign in the temptation to boil the ocean on your first iteration.
6. Gather your sources in a data catalog
It’s time for data producers (typically DBAs or data engineers) to gather up raw data sources to answer the questions posed in the first set of analytics stories. As producers curate sources by story in your data catalog, consumers can evaluate and ask questions about those sources. The initial questions and findings are critical to capture in real-time and can’t disappear into the ether of chat or email. A great data catalog makes this curation, profiling, and question process fluid and eases the overall workflow. This is the step where it becomes clear why you should build a catalog and warehouse at the same time.
7. Build & Document your Data Assets
As data sources get refined into the architectural style you’ve selected, data consumers should be working with the data in real-time and evaluating how good the models are at answering the metrics questions posed. Data stewards build data dictionaries and business glossaries right next to the data being used. Since you’ve curated the sources by analytics story, the appropriate data assets are now discoverable by purpose. By making the data catalog the fulcrum around which the collaboration happens for your new data platform, all this knowledge capture happens in real-time. This minimizes the chore of having to go back and scrape data dictionaries from Google Sheets or write boring documentation. By incorporating your data catalog AS YOU BUILD THE ASSETS, you’re ensuring their reuse and minimizing your knowledge debt.
8. Peer review the analysis
At the end of your first data sprint, it’s time to peer review the work. The process is far more efficient with an enterprise data catalog in place. Your data catalog acts as a consumer and SME friendly environment to ask questions and understand results and prevents the kind of data brawls that happen when people show up to decision meetings with different results and definitions. Everyone can see who’s contributed to the work and other questions that have been asked. Work can be quickly and efficiently validated and extended. Your data work is all in one place: the data catalog.
9. Publish the Results
Congratulations! You’ve got your first set of data models in your shiny new awesome data management platform. Everything is well-curated by analytic stories, peer-reviewed, and documented in your cloud data catalog. You’ve done something good for the business and made it reusable at the same time. Best of all, your team did it without having a massive post-hoc documentation effort because the work was done in the data catalog from the beginning.
10. Refine and Expand
By working in an agile way with your data platform and data catalog at the same time, your assets will be well documented and organized by the time they’re published. With the next sprint coming up, you can now expand or refine the assets that are already published. A jumping-off point where assets are well documented and organized around use cases makes the next sprints easier and easier. You can then expand the program to include more lines of business or working groups. This expansion drives adoption, data literacy, and the data-driven culture we all aspire to.
If you’ve already started down the path of building out a new data warehouse or data lake, you can still adopt agile data governance practices and chip away at any knowledge debt you have (it’s never too late!). Adopting a data catalog that allows you to work iteratively on this will be the key to not feeling like you have an ocean to boil. If you’re interested in learning more or if you’re already working in this way, we’d love to hear from you. Please contact me at firstname.lastname@example.org.