Preview new natural language processing data sets and Jupyter starter notebooks on the IBM Data Asset eXchange – IBM Developer

Join the Digital Developer Conference: AIOps & Integration to propel your AI-powered automation skills Register for free

IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Take a look at what natural language processing notebooks and data sets have been released this year


Join the thousands of developers that have been using the all-new, hot releases of data sets and notebooks on the IBM® Data Asset eXchange (DAX) this fall! DAX is an online center for engineers, researchers, and data scientists to find open and licensed data sets and to help them analyze these data sets using Jupyter Notebooks and other technologies. Since its beginning in 2019, the Center for Open Source Data and AI Technologies (CODAIT) group has been continuously adding new content.

Let’s take a look at what natural language processing notebooks and data sets have been released in 2020.

Natural language processing

Semantic role labeling

These two data sets on banking activities are excellent training data for semantic role labeling (SRL), which is a process that deals with structurally representing the meaning of a sentence.

Finance Proposition Bank

The Finance Proposition Bank is text from approximately 1000 English sentences obtained from IBM public annual financial reports, annotated with a layer of ‘universal’ semantic role labels.

Jupyter Notebook on the finance proposition bank data set

Preview the data set: “Text from approximately 1000 English sentences obtained from IBM public annual financial reports, annotated with a layer of ‘universal’ semantic role labels.”

Preview the Notebook: “FinProp data set consists of proposition bank-style annotations of finance domain sentences extracted from former IBM annual financial reports. Each of the ~1,000 sentences are annotated with a layer of “universal” semantic role labels covering parts of speech, argument labeling, and predicate labeling.”

Contracts Proposition Bank

The Contracts Proposition Bank data set is text from approximately 1000 English compliance sentences obtained from IBM publicly available contracts, annotated with a layer of ‘universal’ semantic role labels.

Data visualization tree diagram of pos relationships

Preview the data set: “This data set contains labeled sentences from IBM publicly available contracts. The sentences were extracted from contract sections such as Business Partner descriptions, Agreement Terms/Structure, Intellectual Property Protection, Limitation of Liability, Warranty Terms, General Principles of Relationship, Terms of Agreement Termination, Withdrawal of Service, Third-Party Claims, Charges, Service Level Agreement, and many more.”

Preview the Notebook: “This notebook explores the Contracts Proposition Bank data set. ConProp consists of proposition bank-style annotations of legal domain sentences extracted from former IBM annual financial reports. Each of the ~1,000 sentences are annotated with a layer of “universal” semantic role labels covering parts of speech, argument labeling, and predicate labeling.”

N-Gram model sequences

An n-gram is a sequence of items from a given sample of text or speech. The following IBM Debater® data sets are great to use for n-gram models by concept abstractness and sentiment composition.

IBM Debater Concept Abstractness

The IBM Debater Concept Abstractness is a set of concepts from Wikipedia rated for their degree of abstractness. Abstractness quantifies the degree to which an expression denotes an entity that can be directly perceived by human senses.

Concept abstractness line chart on the degree of abstractness

Preview the data set: “300K concepts from Wikipedia comprised of 1 – 3-worded phrases/words”

Preview the Notebook: “We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The released data set contains 300K Wikipedia concepts automatically rated for their degree of abstractness.”

IBM Debater Sentiment Composition Lexicons

The IBM Debater Sentiment Composition Lexicons is a data set that addresses sentiment composition, predicting the sentiment of a phrase from the interaction between its constituents.

Sentiment analysis bar chart for first letter of words

Preview the data set: “This data set can be used to learn sentiment compositions by predicting the sentiment of a phrase from the interaction between its constituents.”

Preview the Notebook: “This resource addresses sentiment composition, predicting the sentiment of a phrase from the interaction between its constituents. For example, in the phrases “reduced bureaucracy” and “fresh injury,” both “reduced” and “fresh” are followed by a negative word.”

Parts-of-speech tagging

The Groningen Meaning Bank data set can be used for parts-of-speech tagging. This tagging is the process of marking words with their corresponding grammatical categories by context or definition, such as singular/plural, upper/lowercase, and parts of speech like verbs, nouns, and adjectives.

Groningen Meaning Bank – Modified

The Groningen Meaning Bank – Modified data set is a subset of the Groningen Meaning Bank data set, consisting of documents verified to be in the public domain.

Jupyter notebook on the data set

Preview the data set: “A data set of multisentence texts, together with annotations for parts of speech, named entities, lexical categories, and other natural language structural phenomena.”

Preview the Notebook: “The data set contains tags for parts of speech and named entities in a set of sentences predominantly from news articles and other factual documents.”

Even more data sets and notebooks are being released monthly by the CODAIT team. These data sets are carefully curated and vetted with quality and licensing term checks and are ready to use for AI applications and analysis. Keep up with CODAIT on Twitter so you don’t miss our next drop of assets that you can use to advance your career and business goals.