There are literally millions of datasets on the internet that are open and freely available to use. Whatever your focus, with so much data at your fingertips, it can be tempting to utilize these resources to improve your data processing skills, uncover new information, join a data competition, and solve real world problems. But given the sheer volume of data available, where do you start?

When I was first learning how to work with open data, I would often get lost in the possibility of it all. Not only were there seemingly limitless datasets to choose from, the datasets themselves were often large and unexplored.

I would eventually find a dataset that piqued my interest and start “poking” at it. That is, analyzing random questions as they popped into my head, always on the hunt for a result that was interesting. In many data and science spaces, this type of headfirst-diving into analysis without a focused hypothesis driving the exploration is referred to as a “fishing expedition” or “data dredging."

While the phrase often has a negative connotation, when done correctly — i.e. using appropriate statistical methods and a large enough sample size — this type of data exploration can be useful for generating hypotheses or reporting on the “state of affairs” in the domain of your data. But if you hope to use an open dataset to solve a problem or make decisions, you’ll want to start your data project with a clearly defined problem statement instead.

Note: The rest of this article will assume that you are developing a problem statement using found data — i.e. freely available data that you did not collect yourself. The overarching process is very similar when developing a problem statement prior to finding available data or collecting yourself, but the order may be slightly different.

What is a problem statement?

In its simplest form, a problem statement defines the pain point you hope to solve or the impact you hope to make with your work. This statement should be clear, concise, and define a measurable outcome.

Ultimately, it should answer the question: “What is the problem that you are trying to solve?”

Creating a problem statement is an iterative process. You may have a potential problem statement in mind right now, but progressing through the following steps may help you refine it as you gather more information. This is by design! Eventually, you’ll end up with a problem statement that is:

Understanding the domain

No one expects you to be a domain expert in everything, so don’t let your lack of experience in an area dissuade you from trying to solve a problem in that space. That being said, if you are new to a domain area, make sure that you do your research and — ideally — connect with people actually solving problems.

For instance, if you want to solve a problem about homelessness in your city, but you don’t know much about it and have never experienced homelessness yourself, you should look for some insider knowledge before defining your problem statement. You don’t want to repeat work that’s already been done or solve a “problem” that didn’t actually need solving. Folks working at homeless shelters, organizing nonprofits, or otherwise trying to solve similar problems will have a wealth of knowledge and may be able to help make sure you’re embarking on a useful, actionable mission.

Understanding the data

If you clearly define your problem statement and intend to collect the data needed to solve the problem yourself, you could design your data collection methods to perfectly align to your question.

Alternatively, when you’re working with found data, you are limited by the biases, caveats, and data collection methods that the creators employed when the data were collected. That means that if you are defining your problem statement based on an existing dataset, you need to take all of these factors into account.

Some things you’ll want to make sure you understand about your data:

In data.world, you’ll find additional information about a dataset, as supplied by its collectors in the “About this Dataset” section. Don’t see all the information you need? You can reach out to the dataset creator and other users in the “Discussions” tab of any dataset.

Note: While the above section implies that you may only be using a single dataset to solve a problem, you can incorporate multiple data resources into your project. Just make sure you understand the suggested information about each data resource.

Once you understand the data resources at your disposal, you can assess whether or not they will help you solve your problem.

For instance, if your study into homelessness requires that you understand entrance rates into a homeless shelter every day, but the data you have only consists of weekly admittance rates, you may need to adjust your problem statement. Any mismatches between the needs defined by your problem statement and the available information you can get from your data should be solved by either adjusting your problem statement or finding another data resource.

Creating your own problem statement

Remember, the process of creating your own problem statement is iterative, so you may go through each of the above steps a few times. 

But if you make sure to truly understand the pain points in your domain area and what data you have available to you, you’ll be in a great position to create a clear, concise, actionable problem statement.

Looking for more data to help you craft or solve your problem statement? data.world is home to over hundreds of thousands of freely available, open datasets.

Search our archive, sign up for a free account to add your own data, or take advantage of our in-browser SQL query engines, collaborative tools, and more.

And if you want to put your new problem statement skills to work, join TigerGraph’s Graph for All Challenge for an opportunity to uncover solutions to environmental and social global problems and a chance to win some serious prizes! (Click through for details.)