Behind the code: Connecting GRAMMY Artists with IBM Watson Discovery
Bring technology and pop culture together in a single, highly-engaging experience. Surface hidden connections between GRAMMY-nominated artists over the years.
The GRAMMYs and IBM had one goal: to bring technology and pop culture together in a single, highly engaging experience. Working together, the Recording Academy and IBM decided to use AI to surface hidden connections between GRAMMY-nominated artists over the years. And the result is something called GRAMMYconnect.
Here’s how it works. Most information about musicians is hidden in “dark” data, a vast universe of articles, biographies, and content across various sources that primarily contain natural language text. Identifying, reading, and understanding this content is a huge challenge. So we turned to Watson to help solve it.
Mining and analyzing unstructured data
Our biggest source of data came from the Watson News database of 14 million articles available through the Watson Discovery Service. We also mined the artist pages on GRAMMY.com as well as other publicly available data sources, like Muzooka. We used Watson Discovery to quickly ingest the unstructured data from these sources. Watson Discovery uses natural language processing to read and enrich each news article and piece of content with metadata, identifying artists, their attributes, and the primary connections between them.
Grammyconnect uses Watson Discovery to find uncommon connections between artists
Entity recognition is a powerful part of the Watson Discovery analysis. With it, we were able to identify and categorize entities as “Artist” or “Band,” and dig through our ingested document library to filter down to entity names that matched these criteria. This became the basis of our artist database that is built on IBM Cloudant. We then filtered this enormous data set of entities based on common mentions in the content, identified 50,000 artists and bands, and grouped the data points according to their respective associations.
We then used Knowledge Graph, a beta feature of Watson Discovery that provides the ability to “query by relationship,” allowing us to target and identify sentences where two artists were mentioned as having a specific type of relationship, like performing the same song or having a common influence. Watson Discovery Knowledge Graph gave us a quick and easy way to populate the GRAMMYconnect experience with tens of thousands of interesting data points that would inform our unique artist-to-artist relationships.
Ranking the connections
We wanted to focus the GRAMMYconnect experience on hidden connections, associations that would surprise the average music fan. To do this, we had to drill-down to the connections that were rare or at least unique enough to be surprising.
We started by getting counts of basic facts, like the awards an artist had won or the albums they produced. We assessed how many artists had that fact in common, which helped us understand how rare it was. Ordinarily, it would be time-consuming to use traditional SQL stored procedures for this, but that’s exactly the purpose for which the IBM Cloudant map-reduce views are built. We developed a system of views that would emit a single count for each unique fact we stored, per Artist entity. Next, the built-in Sum-Reduce function told us the total number of entities in our database that had that fact attached. This was essential in building a ranking algorithm that was both dynamic and performant in calculating matches and ranking those connections across 50,000 artists and millions of facts in a reasonable amount of time.
The basic layer of our ranking algorithm aims to undervalue the most common fact types. Simple biographical facts that occur frequently are the least interesting. So by weighting each fact with the inverse of its frequency, we automatically deprioritize the most common facts across all artists. The second layer of the algorithm down-weights the most obviously connected artists, like bandmates or an artist and their producer, by prioritizing artists with the least number of facts in common.
Next, we wanted to make sure there was evidence behind each of the relationships that were presented. But after we started getting results, it became clear that not all evidence text was created equal. Sometimes we would get a well-written sentence that clearly talked about the two artists. But other times, we would have a basic list that didn’t include much information.
To prioritize the more interesting tidbits, we integrated the NLTK Python library’s part-of-speech tagger and developed a custom algorithm to evaluate sentences based on part of speech frequency and patterns, so that we could automatically prioritize the most interesting sentences, and not include ones that were simple lists.
Creating the front-end experience
To make this experience highly engaging for our audience, we spent a lot of time on the front-end experience, from design to development. It was important to reduce load times on an experience serving up hundreds of thousands of data points in such a dynamic fashion. Because there would be little change to the data informing our connections, coupled with the fact that calculations for finding artist connections would not be performed live, IBM Cloud Object Storage was the ideal service for handling extremely large loads and serving cached JSON data.
We adapted our ranking algorithm to run in Watson Studio, in a high-performance, multiprocessing-aware environment, to calculate over 20 million connections, among 50,000+ artists to run, start to finish, in less than 30 minutes. Results were cached to Object Storage, where they could be made available to the front-end web application through standard HTTP requests.
Of course, the connection data is not the only content available in the GRAMMYconnect experience. We also included artist search and a way to track user preferences, as well as managing what connections are trending based on user interest.
Finally, we turned to the robust IBM Cloud Functions service to develop serverless functions that could run on-demand, at scale, to accomplish these tasks.
Complete GRAMMYconnect solution architecture
Beyond the GRAMMYs
The GRAMMYconnect solution brings together a unique combination of IBM Cloud and Watson services to give fans an engaging and appealing experience, driving interest in both the artists and the GRAMMYs. The capabilities used in the solution – including deep understanding of natural language from a huge set of content to identify entities and relationships, performing complex scoring and ranking at scale, and managing interactive, responsive interfaces – are widely applicable across use cases beyond the music industry. Building connections can help uncover cyber threats, proactively identify support issues, or guide maintenance and manufacturing.
To find connections between your favorite artists, check out GRAMMYconnect yourself!