IBM Data Asset eXchange launches new data sets and exploratory Watson Studio notebooks
Access trusted, curated, open source data sets and notebooks
The IBM® Data Asset eXchange (DAX) is an online hub for developers and data scientists to find free and open data sets under open data licenses. A particular focus of the exchange is data sets under the Community Data License Agreement (CDLA). Since launching the exchange in 2019, the Center for Open-Source Data & AI Technologies (CODAIT) team has been working on steadily adding new data sets to the exchange, as well as resources that help explore these data sets.
Our latest update to the Data Asset eXchange adds a host of new data-related assets and user experience enhancements. For existing data sets, we’ve added seven new Watson Studio notebooks as well as three Watson Studio projects (a new class of data assets that package multiple notebooks together). Along with these notebooks, we’ve also added eight new data sets to the exchange, featuring domains such as oil extraction, remote sensing, and speech recognition. Finally, we are working on improving the way DAX displays data set previews. We have begun to add data glossaries and detailed metadata sections to provide users with extra context behind a data set’s features and use cases. We have also started working on accommodating text, image, and audio data record preview allowing users to sample data sets without having to download the entire data set archive.
New Watson Studio projects
We are now gradually adding Watson Studio projects to data sets, which include sets of notebooks that illustrate how users can extract, clean, analyze, and model the data. To import a project into Watson Studio, visit one of the three data sets discussed below on the IBM® Data Asset eXchange (DAX) and click Run dataset notebooks or preview the code by clicking the link in the Use the Dataset section. To learn more about Watson Studio projects, review this tutorial. A fourth project is already under development and we will be adding more projects where relevant.
Data Asset eXchange Weather Project
Our first Watson Studio project release is the DAX Weather Project, which uses the NOAA Weather – JFK Airport data set. This project has three notebooks performing different functions. The first of these is a data cleansing notebook that walks the user through how to impute missing data values and encode certain weather features for better machine learning model performance. The project also includes a data analysis notebook that visualizes the data set’s feature dependencies and trends across time. Finally, the project includes a time series forecasting notebook to build an ARIMA model and evaluate its performance using the RMSE metric.
Data Asset eXchange Fashion MNIST Project
The Fashion-MNIST Project builds off its namesake DAX data set by exploring potential uses for an image data set of clothing articles. This project starts with a data exploration notebook that visualizes various clothing article categories and performs dimension reduction. The second notebook uses the scikit-learn library to compare the performance of traditional machine learning methods to classify clothing labels. The third notebook also designs a classifier, but this time using Keras to build a deep learning convolutional neural network.
Data Asset eXchange Groningen Meaning Bank Project
Our third Watson Studio project, the Groningen Meaning Bank Project utilizes the DAX GMB data set to explore named-entities within text. The project’s first notebook acquaints the user with the different types of entity and part-of-speech tags found in the data set, as well as visualizes attributes of the corpus such as most common tokens. The second notebook walks the user through how to build a simple named entity recognition model, complete with feature engineering and model result analysis sections.
New exploratory data analysis notebooks
Along with the Watson Studio projects, we also released a new batch of exploratory notebooks that accompany our DAX data sets. These notebooks can be accessed by clicking Try the notebook on each of the data set’s DAX landing pages.
Contracts and Finance Proposition Banks
The Contracts Proposition Bank and Finance Proposition Bank data sets, which consist of proposition bank-style annotations of legal and finance domain sentences, now have notebooks that load and visualize these annotations. Included, for example, are graphs that visualize feature distributions such as part of speech tags, as well as graphs that visualize the CoNLL node graph format the annotations are stored in.
IBM Project Debater Wikipedia Oriented Relatedness and Category Stance
Two more notebooks that were recently released supporting IBM Project Debater® data sets, Wikipedia Oriented Relatedness and Wikipedia Category Stance, explore text data extracted from Wikipedia. Both notebooks walk you through how to load the data into Pandas DataFrames. The Wikipedia Oriented Relatedness notebook visualizes a sample of concept relatedness data (scoring the relatedness between two Wikipedia articles) while the Wikipedia Category Stance notebook visualizes a sample of category stance data (pro or con stance of a Wikipedia article on a certain topic).
IBM Project Debater Claims Sentence, Mentions Detection, and Sentiment Lexicons
The final three notebooks added, Claim Sentences Search, Mention Detection Benchmark, and Sentiment Composition Lexicons also all feature data sets originally emanating from IBM Project Debater. The Claim Sentences Search notebook visualizes the data set’s collection of debate topics using topic modeling. The Mention Detection Benchmark notebook tokenizes the included text data and explores the various entity types present in the data. The Sentiment Composition Lexicons notebook counts and visualizes the sentiment of various bigram word pairs included in the data set.
New data sets
TF Speech Commands
The newly added TensorFlow Speech Commands data set contains over 65,000 short audio clips of 30 common spoken English words. The data set is a good fit for training an audio classifier to detect speech commands such “Yes” or “No”. The data set also contains audio files with background noise that can be used to merge with the speech clips to diversify the training data.
The WikiText-103 data set features over 100 million text tokens extracted from the set of verified ‘Good’ and ‘Featured’ articles on Wikipedia. It is available under the CC BY-SA 3.0 license and is a good fit for long-term dependency language modeling.
Oil Reservoir Simulations
The Oil Reservoir Simulations data set consists of 60,000 physics-based simulated oil reservoirs generated by IBM researchers. The features and production-rate labels of this data set are sequence-based, making this data set a good fit for testing and validating sequential algorithms. A detailed notebook is included with the data set and provides visual explanations of the simulations run to generate the sequential data. Included in the data set are the input files to the physics-based simulator as well as raw and pre-processed versions of the data, making this data set a good fit for both novice data scientists and advanced researchers alike.
Wikipedia Entity Graph
The Wikipedia Entity Graph data set developed by IBM Research consists of a knowledge graph of entities from Wikipedia where each entity is supplemented by a context document that represents all the contexts in which the entity appears on Wikipedia. This data set can be used for problems and techniques that perform joint modeling of graph structure and textual data.
Mono Lake Surface Water Extent Landsat8 Data
The Mono Lake Surface Water Extent Landsat8 data set contains Landsat8 satellite imagery data that was post-processed by researchers from IBM Research to measure surface water extent information for the Mono Lake during the time period of 2013-04-18 to 2019-12-31. Surface water extent is important to the studies of land use, water management, and ecosystem health. This data can be used to predict a time-series of lake water extent and condition in order to monitor how a lake is changing over time.
Taranaki Basin Curated Well Logs
The Taranaki Basin Curated Well Logs data set consists of a curated set of 407 oil wells located along the western coast of New Zealand. The underlying data was obtained from the New Zealand Petroleum & Minerals Online Exploration Database and processed and cleaned by IBM Research to produce a simple CSV file containing well logs, well coordinates, and well geological features.
SimpleQuestions and WebQSP Relation Detection
The SimpleQuestions Relation Detection and WebQSP Relation Detection data sets are sets of entity relation annotations generated by IBM Research from underlying question-and-answering data sets. The relation detection task deals with generating semantic relationships between entities in a text. The SimpleQuestions Relation Detection data set was derived using the SimpleQA data set developed by Facebook Research, while the WebQSP Relation Detection data set was derived from the WebQSP data set created by Microsoft Research.