Ever wondered why your dataset or project didn’t get recognized? Eager for collaborators but can’t seem to find the right team? You may have missed the mark on data quality.
Quality data is more approachable and determines whether people can understand and use your data. If you’ve focused on quality from the beginning, people will find it easier to engage with your data and ultimately be more likely to either work with it or incorporate it into their existing work. With great data, the data.world community can create a flywheel effect of data enthusiasm and collaboration.
When it comes to the data.world community, we define data quality as completeness, freshness, reputation, uniqueness, and consumability. While data quality has many definitions and each depends on context and your individual requirements, these five traits are foundational to creating useful data and analysis.
So how can you make sure your data meets the mark? This guide explains how you can use data.world to make sure your data meets each benchmark.
1. Completeness
Is your dataset or project complete? You can make your data and analysis easier to find and more useful to others by documenting it. All it takes is creating metadata. It’s easier than it sounds!
Metadata provides important contextual details. Most datasets aren’t self-describing, so you need to help others understand how to use it to its full potential. These contextual details are especially valuable to other users who happen to stumble upon your work.
We designed this checklist to help you hit all the right metadata notes:
- Description: It’s your hook! Descriptions should be short and serve as a quick reference for the dataset or project it describes.
- Summary: Use this section to tell your data's story. Make sure to include where the data came from. Cite and link to your sources here.
- Data dictionary: Define terms and data types to remove confusion.
- Tags: Tag your dataset or project to improve discoverability. Add multiple tags using your tab key.
- License: If you’d like to read our thoughts on common licenses and which ones we love, check out this blog on licensing we recommend for data.
- Data file: Make sure your dataset actually contains a data file! One of the greatest and most immediate indicators of lack of quality is the absence of any data files. Is it really a dataset if there’s no data in it?
Lastly, consider whether your dataset or project will be open or private to the data.world community. As a Public Benefit Corporation, data.world thrives on collaborative data sharing, and part of our mission is the proliferation of open data. We encourage collaborative data contributions whether that's across teams in your business or as part of the world's largest collaborative data community.
2. Freshness
How fresh is your dataset or project? Data becomes stale when it goes out of date and no longer reflects reality (typically, “reality” is the original source of your data). You might be asking, “So what’s the big deal?” Stale data can invalidate your data work and worse, including:
- Insights become no longer relevant or, even, misleading.
- Insights cannot be created on the fly because of manual data collection, which is incredibly time-consuming and must be done first.
- Consumers of your data work lose trust. (More about trust in the following section, Reputation.)
To keep things fresh, make sure you update your data on a regular basis. This is easy on data.world. Sync your data with automatic sync options so you don’t have to worry about it again. Simply add files from URLs and define sync options, whether that is hourly, daily, or weekly. You can feel good about always having fresh data, spending less time on intense manual work.
3. Reputation
Can people trust your dataset or projects? Like in life, your reputation is everything on data.world. Most people know that evaluating sources is an important part of the research process. If there is no clear indication whether your data or analysis is trustworthy, your work will likely go ignored, regarded as unreliable.
You can establish trustworthiness by answering these questions about your dataset or project:
- Does the source include citations and origins of the data? Transparent methodologies? A clear chain of provenance?
- Are data files substantiated by supplemental documentation, such as data dictionaries?
- Is the data from an unbiased source, such as a government or an educational institute website?
While they’re not all mandatory, these show your data is trustworthy. The more you have, the easier it is for others to trust your data and use it, knowing it will give them the most complete, accurate, and relevant information they need to make a decision, tell a story, or understand the business.
4. Uniqueness
Is your dataset or project unique? There is a lot of data on data.world, so search first to see if the same dataset has already been uploaded. There’s nothing wrong with uploading your own copy, but improving someone else's established work through collaboration or direct linking will keep that data’s “narrative” in one place and can benefit the entire community.
If you find out another user has beat you to uploading similar data, don’t worry! There is still an opportunity to contribute. Here are two solutions:
- Ask to become a contributor and start sharing and discussing insights directly in the dataset or project.
- Create a project and connect existing data.world data sources using the “+ Add” button in the project’s workspace.
5. Consumability
Is your dataset or project consumable? Will others be able to use your data? Here’s how to make sure:
Queryability
Ensure your data is queryable. Sometimes its formatting doesn’t translate the way you think it will, making querying impossible. Watch out for these potentially corrupt or incomplete parts of your data and other common gotchas.
- File formatting errors like multiple headers, merged cells, improperly escaped characters can result in ambiguous column names, such as “col a”, “col b”, etc. For best results, remove complex formatting from Excel workbooks and follow the definitions of the CSV format.
- Long file or column names can make tables hard for others to use. Keep names as succinct yet descriptive as possible.
- Date columns can be confusing when working in formats that store dates as sequential integers. If you have dates in your dataset, we recommend converting the file to CSV or TSV before uploading.
Matchability
A knowledge graph powers the data.world data catalog. As a result, we can match, enhance, and understand your data.
Read our blog to learn more: what is a data catalog?
As we process your data, we look for matches with any known data types within the data.world system, which helps find relationships across data files and align all data on our platform with industry standards.
Make sure you are getting the most out of our data matching functionality. Click on those little green triangles next to column names for our matches!
File Size & Format
Is your file too big? Try compressing large files to a size that makes it easier for others to quickly work with your data. And remember: Sampling can be useful for datasets that are too large to efficiently analyze in full.
There are no restrictions on file types that can be uploaded or downloaded on data.world. However, certain formats like PDFs are not queryable formats, so you’ll only be able to store and view them.
One important thing to remember about file size and format is that they’re not always good indicators of quality. Bigger isn't always best for what you need, and sometimes the juiciest information is packaged up in a different format than you'd expect.