Photo: “Linderman Library, Lehigh University” by NDXP is licensed under CC BY 2.0

Can you imagine trying to find a bit of information on the Web without using a search engine? Of course not. Similarly, the number of public datasets now available online is so overwhelming, they too are impossible to sort through. 

As the volume of online data continues to increase, the ability to locate and identify usable and useful data sets becomes a critical challenge. This is especially true for people whose work requires finding information across different domains, from different repositories, and in different formats. 

We need new approaches to full-content dataset search. We need new search tools that can ultimately assist many types of users to locate data for different purposes, and enable public dataset discovery and reuse, regardless of who produced the data or where it is stored. 

A dataset search engine like this would benefit society by helping searchers accelerate their work and reduce duplicate efforts. In particular, it will benefit users, such as data journalists, who constantly look for data from different sources, fields, and repositories. Data promises a new source of evidence, and for story discovery, a new way for story-telling and fact-checking, to make reporting that is both meaningful and trustworthy. 

We are excited to share a research partnership with Lehigh University working with Professor Haiyan Jia from the Department of Journalism and Professors Brian Davison and Jeff Heflin from the Department of Computer Sciences. The goal of their research is to develop a prototype dataset search engine incorporating new techniques for full-content indexing to enable searchers to find data across the web, regardless of domain.

How you can help

In order to develop tools that better facilitate cross-domain, interdisciplinary dataset search, we need to first understand users’ needs and preferences, as well as their search strategies and habits. Therefore, we are collaborating with the Lehigh University research team on a survey. 

We believe that you, our users, have the most valuable insights to share. We want to know what the defining factors of a useful dataset search engine would be. We want to hear about your experiences searching, collecting, assessing, and utilizing data, your opinions on current dataset search tools, and your ideas on what can be improved. 

Please consider participating in this 20-minute online survey. Your expertise and input will be a valuable contribution to our understanding of cross-domain dataset search and the conceptualization, design, and development of more effective dataset search tools.

If you have any trouble opening the survey, please copy and paste the URL below into a new browser or tab:

https://lehigh.co1.qualtrics.com/jfe/form/SV_ddszTSc3Jx8Q61v

If you would like more information or have ideas to share, we’d love to hear from you. Please contact Professor Haiyan Jia (haiyan.jia@lehigh.edu) and Juan Sequeda (juan@data.world) with any questions or comments.

(Note: As a Public Benefit Corporation, data.world’s mission is to build the most meaningful, collaborative and abundant data resource in the world in order to maximize data’s societal problem-solving utility. On the course of this mission we encounter challenging and exciting engineering problems that our development team tackles every day. We are also faced with problems that do not have a clear and immediate path to a solution: open scientific research problems. We are excited to partner with leading researchers around the world to share problems and develop novel solutions.)