At the US Open: Machine writing and discovery with Watson

Each year, the US Open reaches over 10 million fans around the world through their digital platforms. This year marks the first “fanless” US Open stadium experience, placing more importance on the digital experience than ever before.

For this year’s US Open, IBM is introducing two new innovative technology solutions: Match Insights with IBM® Watson™ Discovery and Open Questions with Watson Discovery.

  • The Match Insights feature surfaces factoids about players across millions of articles and blogs, and generates text from historical statistical data.

  • The Open Questions feature uses the natural language processing capabilities of Watson Discovery as well as IBM Research technology to analyze millions of sources and identify pro and con arguments for ongoing tennis debates. It can even assess the quality of those arguments, then wrap it all up into a simple cohesive summary. Fans are able to share their opinions on these iconic tennis debates, and engage on topics they are most passionate about. News, facts, and crowd input contribute arguments to the generated narrations.

Technologies behind these features include IBM Project Debater, Watson products and services on IBM Cloud, natural language generation open source technology, and natural language generation technology from IBM Research and DeepQA (Jeopardy!) projects. Each of these groundbreaking projects produced novel technologies for a new and evolving industry, Machine Writing and Narration, which is showcased at the 2020 US Open.

Open Questions with Watson Discovery

The Open Questions with Watson Discovery feature for the US Open is designed to engage tennis fans in “debates.” A set of unchanging topics about some of the greatest players in the history of tennis will be presented to fans. Alongside each topic, IBM Research technologies will synthesize a point of view from Watson Discovery-connected data sources (which include articles, blogs, and other news sources), a News Archive custom collection, a US Open custom collection, Wikipedia, and input from tennis fans. Before the start of the tournament, a fact-based narration that is based on previously written articles and blogs that have been stored in the News Archive custom collection (10 years of tennis-related sources), the US Open Archives custom collection, and Wikipedia will be generated in several steps.

First, the corpora of data is queried and searched for relevant articles about a topic such as “Chris Evert is the best women’s player in tennis.” We use natural language processing technologies provided by Watson Discovery to refine data searches based on entities, keywords, and themes. Articles that are sourced from Watson Discovery-connected data sources (which include articles, blogs, and other news sources) or the News Archive custom collection are summarized using extractive summarization (a query-based summarization) technologies. Each of the sentences within the summarization along with the sentence before and after become candidate arguments. The large number of articles can be considered as evidence that support a topic because only the most relevant text of each article is extracted and processed.

Next, each of the Wikipedia results from the IBM Research Index Searcher API, supported by natural language processing technology from IBM Project Debater, is added to the list of candidate arguments. Filters are applied to the candidate articles to enhance precision towards the topic. The remaining candidate arguments are sent to the IBM Research Argument Quality API. The quality assessment ensures that the detected claims are relevant to the provided topic. The top 75% of the highest-quality candidate arguments are retained for narration generation. At this point, the candidate arguments become approved arguments.

The arguments are classified into pros and cons using a trained IBM Research Pro/Con API running on an IBM private cloud. The scores are boosted to create additional spread between the pros and cons. A reference to each pro and con source is maintained outside of the narrative generation client. Now, only the pro arguments, dominant concept, topic, pro scores, polarity, and custom filters are sent to the IBM Research Narrative Generation API. An automated narration organized around extracted themes is created and saved into a Cloudant® database.

During the US Open, fans will add their opinions about a topic, and that input will be added to the corpora of data. In addition, media might write late-breaking articles about one or more of the topics. Both Watson Discovery and the opinions from fans will be added to the corpora of data. Each day at 5 p.m. EDT, new narrations around each topic will be produced from arguments generated from the updated corpora (Watson Discovery-connected data sources, which include articles, blogs and other news sources, and fan opinions) and the base corpora (News Archive custom collection, US Open custom collection, and Wikipedia).

Figure 1 shows the system architecture for Open Questions.

Open questions architecture Figure 1. Open Questions architecture

Match Insights with Watson Discovery

Throughout the tournament, each player’s tennis persona is summarized by insights that are written from natural language and statistics. Statistical data about a player’s performance is used to generate fluid facts. In addition, single sentence insights about each player are extracted from a corpus of millions of articles. Figure 2 shows the architecture of Match Insights.

Match Insights architecture Figure 2. Match Insights architecture

Match Insights natural language generation

Across the hundreds of points in a tennis match, players make both winning shots and unforced errors in various situations. This results in dozens of distinct measures for tennis analysts to consider as they explain on-court action. These statistics are particularly useful when previewing an upcoming matchup, as they can indicate the relative strengths and tendencies of each player. Does this player hit many winners from her forehand? Does the player often approach the net? Statistics can answer these questions and many more, giving fans insight into the forthcoming match. A skilled analyst can uncover the areas in which the players stand out from the field and differ from their opponents. However, Match Insights with Watson Discovery presents this comparative analysis in natural language, serving statisticians and casual fans alike.

IBM maintains databases that store these statistics and other relevant information using the Db2® on Cloud service. In their raw form, these stats are still difficult to interpret. Comparisons are difficult to make because matches can differ in length, from under 1 hour to over 4 hours. To normalize for this variability, IBM calculates per-point frequencies. Each frequency is then converted to a rank value with respect to that statistic among the entire tournament field of 128 competitors. To further ease comparisons, these values are also expressed in percentile terms and persisted into Db2 on Cloud in this format.

The most extreme values in percentile terms are the items that will be most interesting to the tennis audience. Additionally, Match Insights draws contrasts by highlighting the stats with the largest percentile differences between the two players in the matchup. After these key stats are selected, the system converts the stats to natural language. To do this, the system must understand the various components of a statistical highlight. These components include the subject phrase, verb phrase, and contextual phrase. As humans generate natural language using various word choices and syntactical ordering, the AI system also varies these elements to produce human-like language. The output structures and diction are then selected according to probability. At this level of variety, the natural language generation system, which is powered by open source natural language generation and IBM Research technologies, can produce hundreds of unique texts for each match’s selected stats. Additional processing then confirms grammatical correctness such as pronoun, article, and verb agreement.

The final task of the Literature Generator web service is to persist the texts and corresponding metadata to a Cloudant NoSQL database on IBM Cloud, which feeds the human review UI. After Match Insights receives approval, a serverless function on IBM Cloud Functions merges the natural language generation output with factoids and writes the joined JSON document to IBM Cloud Object Storage. IBM Cloud Object Storage serves as the back-end data source for the user-facing Match Insights with Watson Discovery feature on usopen.org.

Match Insights with Watson Discovery uses natural language summarization

When watching a tennis match, a spectator might know everything there is to know about a player: from journey to professional status such as how they developed their signature backhand shot. Alternatively, they might know little beyond the surface and might very well be seeing this person play for the first time. Millions of unstructured records such as news articles, Wikipedia pages, blog posts, and forum discussions have been authored on the topics of tennis and players of varying prominence. The type of data is diverse, ranging from biographical and statistical to commentary and opinion pieces. The challenge of effectively using these rich sources of information lies within identifying content that is relevant to a specific player, summarizing the parts of the text that are relevant to the player, and having the ability to measure the insightfulness of the resulting summarization.

First, the unstructured information gathered from Watson Discovery, the US Open Archives custom collection, and Wikipedia is queried. By querying hundreds of tennis players against the corpora, there are millions of possible insights. The content must be summarized to efficiently uncover the relevant and insightful content. To begin, a set of query and filter strings for Discovery-sourced content is constructed. The query specifies the name of the player, while the filter restricts the category of results to tennis and uses the enriched natural language processing metadata to narrow the results to domain-relevant entities and concepts. However, some players have high-volume data coverage while newer players have sparse information. To mitigate the data class imbalance, the querying stage of the data gathering has a two-stage fallback whereby the filter parameters are relaxed to broaden the search.

After an article is retrieved, a hash string based on the article’s content is produced and checked against previously seen hashes. If a hash match is found, the article is discarded to avoid duplication. If the article is unique, then it along with the hash are stored in a Redis memory store.

Next, the data is normalized to match the schema and passed to a summarization feature created by IBM Research and available as part of Watson Discovery, which summarizes the article to roughly a quarter of its original length. Each “summary unit” or a single sentence is hashed and checked for duplication by the same process as the source articles. After they have been verified as unique, the summaries are passed to the quality service to measure the insightfulness of the sentence. A sentence structure uses an English parse tree along with a decision tree to determine whether the sentence has good syntax form. A second score is generated using the Argument Quality API from the IBM Research program, which ensures that the sentence is relevant to the topics. Both the sentence structure and topic relevance scores are averaged together to provide an overall quality assessment. The discovered insight and the quality score are stored in the Cloudant database for the front end to consume through a set of design documents that enables fast document indexing and retrieval.

The information is handled throughout the system by a series of Redis-based queues that rate-limits outbound requests and retries any external server-side failures. Four queues determine the processing cadence with Watson Discovery, summarization service, quality service, and storage. Data is passed to each of these queues by a master message broker and three auxiliary brokers that handle interval hash verification, requeuing for query expansion, and event stream handling. This centralized data handling structure is a core requirement for the system as it processes gigabytes of raw unstructured information.

US Open virtual fan engagement

During this unprecedented time, a new and engaging virtual experience will increase your presence at the US Open. Now, all you need to do is sit back, relax, and enjoy the insights alongside the tennis action. If you are daring, contribute to the discussion about some of the greatest tennis players of all time.