Digital Developer Conference: Hybrid Cloud 2021. On Sep 21, gain free hybrid cloud skills from experts and partners. Register now

IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Data set sharing made easy by an IBM standardization effort


Sharing and using data sets has never been a smooth job. Data set distributors usually must write extensive documentation about their data sets that data consumers must go through. This documentation is often error-prone, and an inadvertent error can cause significant confusion. On the other hand, data consumers must spend a significant amount of time understanding the structure of the data archive and write complicated code to extract and load them.

Data consumer
Data consumer

Data distributor
Data distributor

At IBM, we created the Python package ParData to resolve these issues in data set sharing. A data set distributor must only accompany a schema that describes the data set, and a data set consumer can easily start working with these data sets with a few lines of code. Let’s take the data set JFK weather as an example. JFK weather is a data set that contains historical weather data for the JFK airport in New York City.

DATE,HOURLYDewPointTempF,HOURLYRelativeHumidity,HOURLYDRYBULBTEMPF,HOURLYWETBULBTEMPF,HOURLYPrecip,HOURLYWindSpeed,HOURLYSeaLevelPressure,HOURLYStationPressure
2015-07-25T13:51:00Z,60,46,83,68,0.00,13,30.01,29.99
2016-11-18T23:51:00Z,34,48,53,44,0.00,6,30.05,30.03
2013-01-06T08:51:00Z,33,89,36,35,0.00,13,30.14,30.12
2011-01-27T16:51:00Z,18,48,36,30,0.00,14,29.82,29.8
...

Traditionally, before starting to work with this data set, a data consumer must download the data set, unarchive it with the tarfile module, determine the data type for each column, and load it into a pandas.DataFrame object. With ParData, the data set distributor only needs to create a schema file in a yaml format (which we have created already, and ParData should have it by default).

name: NOAA Weather Data – JFK Airport
published: 2019-09-12
homepage: https://developer.ibm.com/exchanges/data/all/jfk-weather-data/
download_url: https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-data-jfk-airport.tar.gz
sha512sum: e3f27a8fcc0db5289df356e3f48aef6df56236798d5b3ae3889d358489ec6609d2d797e4c4932b86016d2ce4a379ac0a0749b6fb2c293ebae4e585ea1c8422ac
license: CDLA-Sharing-1.0
estimated_size: 3.2M
description: "The NOAA JFK dataset contains 114,546 hourly observations of various local climatological variables (including visibility, temperature, wind speed and direction, humidity, dew point, and pressure). The data was collected by a NOAA weather station located at the John F. Kennedy International Airport in Queens, New York."
subdatasets:
  jfk_weather_cleaned:
    name: Cleaned JFK Weather Data
    description: Cleaned version of the JFK weather data.
    format:
      id: table/csv
      options:
        columns:
          DATE: 'datetime'
    path: noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv

The data consumer only needs to have one line of code.

noaa_jfk_df = pardata.load_dataset('noaa_jfk')

The CSV file has been loaded to a pandas.DataFrame object with all of the columns being the correct data types. For more information, check out the documentation and the tutorial.