IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Access trusted, curated open source data sets


As more companies adopt artifical intelligence (AI), placing machine learning (ML) models into the hands of developers is imperative. Today at OSCON 2019, we announced the launch of the IBM Data Asset eXchange (DAX), an online hub for developers and data scientists to find carefully curated free and open data sets under open data licenses. Developers adopting ML models need open data that they can use confidently under clearly defined open data licenses.

Where possible, data sets posted on DAX will use the Linux Foundation’s Community Data License Agreement (CDLA) open data licensing framework to enable data sharing and collaboration. Furthermore, DAX provides unique access to various IBM and IBM Research data sets. IBM plans to publish new data sets on the Data Asset eXchange regularly. The data sets on DAX will integrate with IBM Cloud and AI services as appropriate.

Trusted source of open data sets

For developers, DAX provides a trusted source for carefully curated open data sets for AI. These data sets are ready for use in enterprise AI applications, with related content such as tutorials to make getting started easier.

For staff responsible for dataset usage and vetting, DAX provides curation as well as standardized dataset formats and metadata, in contrast with most other open dataset resources that tend to incorporate fewer quality and licensing terms checks. So DAX data sets are typically more straightforward to adopt within corporations.

Example of data sets in use

An example of the sorts of data sets we’re releasing is the Finance Proposition Bank and Contracts Proposition Bank data sets. These data sets are part of an active research program from IBM Research. This research project aims to improve the natural language understanding technologies behind multiple IBM product offerings, including Watson Natural Language Understanding and Watson Compare & Comply.

Our researchers created these data sets with input from Watson developers, matching the characteristics of the target text to those of the real-world documents that the system analyzes in production. The researchers used these data sets to train domain-specific versions of the parsers that extract semantic meaning from governing business documents such as legal agreements and financial reports.

IBM Research has a long history of doing this kind of work in the open, and we on the CODAIT team are proud to help IBM Research’s mission of openness by releasing this cutting-edge research data on the Data Asset eXchange.

Why DAX?

While there are many resources available online for finding open data sets – ranging from collections of links on GitHub to sites such as Kaggle data sets – DAX is unique in its high level of quality and curation. DAX helps create end-to-end deep learning workflows (from using the data to train models to deploying models in standard ways) allowing developers to consume open data with confidence under clearly defined open data licenses.

Data you need to develop AI solutions

The CODAIT team’s goal is to make it straightforward to use DAX assets in conjunction with IBM AI products as well as other hybrid, multicloud AI tooling, both proprietary and open source. We want to give data scientists and developers well-curated data starting points, so that it’s easier for them to start developing their AI applications and solutions.