Blog

2020 Call for Code® Global Challenge

Creating and deploying open source technologies to tackle the world's greatest challenges

This blog is part of the 2020 Call for Code Global Challenge.

Introduction to open data sets and the importance of metadata

More data is becoming freely available through initiatives such as institutions and research publications requiring that data sets be freely available along with the publications that refer to them. For example, Nature magazine instituted a policy for authors to declare how the data behind their published research can be accessed by interested readers.

To make it easier for tools to find out what’s in a data set, authors, researchers, and suppliers of data sets are being encouraged to add metadata to their data sets. There are various forms for metadata that data sets use. For example, the US Government data.gov site uses the standard DCAT-US Schema v1.1 whereas the Google Dataset Search tool relies mostly on schema.org tagging. However, many data sets have no metadata at all. That’s why you won’t find all open data sets through search, and you need to go to known portals and explore if portals exist in the region, city, or topic of your interest. If you are deeply curious about metadata, you can see the alignment between DCAT and schema.org in the DCAT specification dated February 2020. The data sets themselves come in various forms for download, such as CSV, JSON, GeoJSON, and .zip. Sometimes data sets can be accessed through APIs.

Another way that data sets are becoming available is through government initiatives to make data available. In the US, data.gov has more than 250,000 data sets available for developers to use. A similar initiative in India, data.gov.in, has more than 350,000 resources available.

Companies like IBM sometimes provide access to data, like weather data, or give tips on how to process freely available data. For example, an introduction to NOAA weather data for JFK Airport is used to train the open source Model Asset eXchange Weather Forecaster (you can see the model artifacts on GitHub). You may also be interested in the IBM Data Asset eXchange (DAX) where you can explore useful data sets for enterprise data science. You can also register to access IBM's PAIRS (Physical Analytics Integrated Data Repository and Services) data sets at https://ibmpairs.mybluemix.net/. These data sets are normalized and easy to use.

Another example is Anthem, Inc. who provides researchers and developers access to their secure Digital Data Sandbox in order to enable solutions to some of healthcare's most complex issues. With a certified de-identified data set* of more than 45 million unique lives spanning over 12 years, the Digital Data Sandbox offers the unprecedented ability to discover insights, build and train algorithms, validate solutions with Anthem experts, and deploy those solutions in the real world. To learn more about the Digital Data Sandbox, please go to https://www.anthem.ai/sandbox.

If you're looking for openly available voice data to train speech-enabled applications, Mozilla's multilingual Common Voice data set might be something for you. Each entry in the data set consists of a unique MP3 and corresponding text file. Many of the currently 4,200+ recorded hours in the data set also include demographic metadata like age, gender, and accent that can help train the accuracy of speech recognition engines. In the latest release, there are 40 languages represented, including English, French, German, Spanish, and Mandarin Chinese (Traditional), but also, for example, Welsh, Kabyle, and Kinyarwanda. As a community-driven project, people around the world who care about having a voice data set in their language have been responsible for each new launch, making Common Voice more global and inclusive with every release.

When developing a prototype or training a model during a hackathon, it’s great to have access to relevant data to make your solution more convincing. There are many public data sets available to get you started. I’ll go over some of the ways to find them and provide access considerations. Note that some of the data sets might require some pre-processing before they can be used, for example, to handle missing data, but for a hackathon, they are often good enough.

Picture of clouds

You can use Google Dataset Search. With the Dataset Search tool, you can locate data sets through keywords such as a country or city, or a category such as medical or agriculture. There are additional filters you can apply such as how recently the data set was updated, the download format (for example, JSON or image), usage rights (commercial or non-commercial), and whether the data set is free. Dataset Search is a great tool for data sets where metadata (such as https://schema.org/ tags) have been supplied with the data set. However, there are data sets that do not yet have metadata in the form that Google Dataset Search uses so that’s when you go to locations where there are many data sets. Of course, some data sets can be found using both methods.

Ways to find data sets: Go to locations where there are many data sets

Many governments and institutions such as the United Nations and the World Economic Bank provide data sets. Following are some examples:

Data set aggregator sites and miscellaneous catalogs

Some sites collate data sets into categories sourced from other locations including data sets from the data.gov sites. It’s worth taking a look at these sites, noting that some do charge for specialized access. However, these aggregator sites do give you an idea of what’s available. Examples of sites that aggregate collections of data sets or provide introductions to open data sets include:

License and privacy considerations

It is easier to use factual data sets such as measurements, tabular data, land mass, reservoirs, and weather, and avoid personal data such as names and pictures of people that might have privacy concerns, which vary from country to country.

Occasionally, you will find data sets that will state that they are for academic use only. The owners are usually fine with the data set being used in a hackathon setting, but it is best to check. An example of such a data set is a multimodal (image and text) Deep Learning For Disaster Response data set (https://gitlab.com/awadailab/crisis_multimodal), which states that it is available for download only for academic purposes. In this case, we have confirmed with the author that she is agreeable that the data set may be used in hackathons, particularly those for social good. You can take a similar approach. And please note if you move on and start selling the software you created in the hackathon or make it part of a product, then you should not use data sets that are marked for academic use.

Many data sets, where there is a license specified, will have a Creative Commons (CC) license. An example of such a data set is the earthquake data EEW. Be aware that the CC by NC variant means that the data set cannot be used for commercial purposes.

* In accordance with an expert de-identification certificate issued to Anthem.

Susan Malaika