The Blog

 

As more companies adopt artifical intelligence (AI), placing machine learning (ML) models into the hands of developers is imperative. To that end, the Center for Open-Source Data & AI Technologies (CODAIT) launched IBM Model Asset eXchange (MAX) in 2018 to help data scientists and developers easily discover ready-to-use free and open source machine learning and deep learning models.

Today at OSCON 2019, we announced the launch of the IBM Data Asset eXchange (DAX), an online hub for developers and data scientists to find carefully curated free and open datasets under open data licenses. Developers adopting ML models need open data that they can use confidently under clearly defined open data licenses.

Where possible, datasets posted on DAX will use the Linux Foundation’s Community Data License Agreement (CDLA) open data licensing framework to enable data sharing and collaboration. Furthermore, DAX provides unique access to various IBM and IBM Research datasets. IBM plans to publish new datasets on the Data Asset eXchange regularly. The datasets on DAX will integrate with IBM Cloud and AI services as appropriate.

Trusted source of open datasets

For developers, DAX provides a trusted source for carefully curated open datasets for AI. These datasets are ready for use in enterprise AI applications, with related content such as tutorials to make getting started easier.

For staff responsible for dataset usage and vetting, DAX provides curation as well as standardized dataset formats and metadata, in contrast with most other open dataset resources that tend to incorporate fewer quality and licensing terms checks. So DAX datasets are typically more straightforward to adopt within corporations.

Example of datasets in use

An example of the sorts of datasets we’re releasing is the Finance Proposition Bank and Contracts Proposition Bank datasets. These datasets are part of an active research program from IBM Research. This research project aims to improve the natural language understanding technologies behind multiple IBM product offerings, including Watson Natural Language Understanding and Watson Compare & Comply.

Our researchers created these datasets with input from Watson developers, matching the characteristics of the target text to those of the real-world documents that the system analyzes in production. The researchers used these datasets to train domain-specific versions of the parsers that extract semantic meaning from governing business documents such as legal agreements and financial reports.

IBM Research has a long history of doing this kind of work in the open, and we on the CODAIT team are proud to help IBM Research’s mission of openness by releasing this cutting-edge research data on the Data Asset eXchange.

Why DAX?

While there are many resources available online for finding open datasets – ranging from collections of links on GitHub to sites such as Kaggle Datasets – DAX is unique in its high level of quality and curation. DAX helps create end-to-end deep learning workflows (from using the data to train models to deploying models in standard ways) allowing developers to consume open data with confidence under clearly defined open data licenses.

Data you need to develop AI solutions

IBM designed the Data Asset eXchange repository to complement the Model Asset eXchange. The user interface for organizing the assets is consistent across the two platforms, and users can easily train models on MAX using data from the Data Asset eXchange.

The CODAIT team’s goal is to make it straightforward to use DAX and MAX assets in conjunction with IBM AI products as well as other hybrid, multicloud AI tooling, both proprietary and open source. We want to give data scientists and developers well-curated data starting points, so that it’s easier for them to start developing their AI applications and solutions.