GRAMMY Debates With Watson: From the lab to music’s greatest stage

The journey of adapting Project Debater from IBM Research to the GRAMMYs created a resilient and componentized system that can summarize opinions on music topics into fluent narrations. Come contribute!

The Project Debater workflow combines artificial intelligence and natural language processing (NLP) techniques to summarize music fans’ point of view around the most important music conversations. The first phase of the workflow is collecting arguments made by fans. These arguments are collected from two sources: through a dedicated web page on and by mining arguments from Twitter. In the second phase, we run the Speech by Crowd natural language processing pipeline to generate a summary of all the arguments collected, highlighting the key points made by the participants and generating summary narratives for each side of the debate. Let’s walk through the process.

The music debates

Over a two-week period leading up to the 63rd annual GRAMMY awards show, we are presenting four debatable topics to the fans to provide their own unique insight. The first asks about who you think is or was the most groundbreaking artist of all time. As expected, we are receiving a variety of responses. Next, we wanted to know what music fans and experts thought about mandatory music education. The topic that states, “Music education should be mandatory in all K-12 schools” is resulting in great arguments for and against the notion. Most of the arguments supported the opinion that music education should be mandatory in schools. However, many did not agree because they thought only the most well-funded schools can afford music education, thus increasing the socioeconomic divide.

Another topic that we wanted to know your opinions about was comparing virtual concerts to live shows. This was the most evenly split topic between support and disagreement. Perhaps people are becoming acclimated to technology and enjoy watching shows in their own comfort. The crowd thought that the demand for a hybrid of virtual and physical shows will continue to emerge.

Finally, we wanted you to weigh in on who you think is the biggest style icon in music. This is your chance to tell us how music and fashion are related.

Music debates
Figure 1. The great music debates

Now let’s get into the architecture of the system.

GRAMMY Debates with Watson system architecture

Figure 2 shows the overall architecture of the GRAMMY Debates with Watson system. The system runs on a hybrid cloud that consists of IBM Cloud and an IBM Cloud private cloud. Throughout the cloud components, the system is packaged into images that can be run anywhere on top of Red Hat OpenShift that manages Kubernetes clusters. Most of the OpenShift clusters run on IBM Cloud. There are two OpenShift clusters with nine workers each. Each worker has four cores and 16 GB of memory to support our natural language processing workloads. A total of 11 apps are spread evenly across the worker nodes over three regions.

The system has several bare metal machines that support the workflow. The Redis lyric infringement detection app runs on two bare metal machines. Each bare metal machine has 36 GB of memory and 20 cores. The raw compute power can keep up with the pace of argument infringement detection.

Overall architecture
Figure 2. The overall architecture of GRAMMY Debates with Watson

The consumer-facing applications are deployed on a multitenant private cloud. The debater API interface and messenger Python microservices take all of the consumer traffic. Each is scaled out to two PODs across four regions. There are test, development, and production environments to ensure that the work passes functional and quality testing before deployment to production. In parallel, the IBM Research private cloud runs on dedicated IBM Kubernetes clusters that are fronted by several ingress NGINX nodes. The pro/con, argument, and debater key point analysis services are isolated on separate clusters to support the natural language processing scale. Additional worker nodes can be provisioned as required.

To help with the load, we used a combination of GPUs and CPUs. For all of the offline jobs, the services used GPU-based clusters. This enabled us to handle batch jobs quickly. The online services such as the pro/con and argument quality handle real-time user responses with CPU-based clusters. However, the computationally intense online key point analysis service requires GPUs.

Next, we use the Speech by Crowd Debater platform. The Speech by Crowd platform is deployed on a K8 cluster. Each of the NLP services used by the Speech by Crowd system is dedicated. The databases, PostgreSQL and MongoDB that support the Speech by Crowd platform are located on IBM Cloud.

To manage the data across the entire architecture, we use IBM Cloud Object Storage and IBM Cloudant. Both use JSON document formats. The IBM Cloud Object Storage is the origin for the IBM Content Delivery Network. To handle message passing, we instantiated IBM Event Streams Kafka. Submitted arguments queue on topics, and are then processed by enrollment threads. IBM Cloud Internet Services provides a set of edge servers so that we can apply edge functions to all incoming traffic. We also had a cloud-enabled Redis that supported the traffic throttling solution.

Now, let’s break this architecture into two pieces.

Phase 1: Argument mining from Twitter

Leading up to GRAMMY night, fans can respond to Tweets about each topic such as “Virtual concerts are a better experience than live shows.” This is the crowd’s opportunity to voice their opinions on Twitter. A set of applications searches for all of the relevant responses to a top-level Tweet to include into the corpora of candidates arguments. To optimize quality and language fluency, each Tweet is processed by several algorithms to determine wether they are a fit for speech synthesis. Figure 3 highlights the components of the argument mining process.

Argument mining
Figure 3. The argument mining portion of the architecture

The architecture

The argument mining workflow is orchestrated by a Python application that runs on OpenShift. The application queries responses to Tweets from social influencers to find highly relevant and polarized opinions. Hashtags, symbols, and @ indicators are removed from the Tweet. Now, the clean text is posted to an extractive summarization capability to compress the text into a single sentence.

Next, each of the cleaned and focused sentences are paraphrased by a T5 language model. Candidates are ranked according to surface and semantic forms, producing a quality score. Only the highest-quality sentences that pass an experimentally determined threshold continue to the next step in the process. Each high-quality sentence is posted to a copyright infringement service. Now, all of the successfully passed opinions are batch ingested into the Speech by Crowd IBM Research platform and associated with a particular topic. The entire process is a 5-step process.

Step 1. Twitter mining

The Twitter mining process is incredibly detailed. At a high level, we first timebox and search top-level Tweets for replies using Twitter’s API to set a search dictionary. We then look for all direct replies to the original author within a four-hour time window. From here, the process of mining the conversation thread begins. If an eligible Tweet is retrieved, we clean it by stripping out the text and applying quality filters.

However, to find relevant Tweets, we use a computing technique called recursion. For example, searching for replies within any number of threads requires using tail recursion. As shown in the following code example, the _get_replies first method has an escape if statement that limits it to a depth of three conversations. If we reach that limit, we return the retweet and process the results. If we do not reach the depth limit, we continue deepening.

def _get_replies(self,payload,tweet_id,screen_name,tweet_creation_datetime,counter):
  payload_usr = dict.copy(payload)
  payload_usr['query'] = 'to:'+str(screen_name)
  payload_usr['fromDate'] = tweet_creation_datetime.strftime("%Y%m%d%H%M")
  if 'toDate' in payload:
    del payload_usr['toDate']
  response =, json=payload_usr, auth=HTTPBasicAuth(self._user, self._pwd)).json()
  retweets = []
  if counter>4:
    return retweets
  if 'results' in response:
    for tweet in response['results']:
      if 'text' in tweet and 'in_reply_to_status_id' in tweet:
        if tweet['in_reply_to_status_id'] == tweet_id:
          if 'extended_tweet' in tweet:
            tweet_text = tweet['extended_tweet']['full_text']
            tweet_text = tweet['text']
      if 'reply_count' in tweet and tweet['reply_count']>0 and 'user' in tweet and 'screen_name' in tweet['user']:
        pick_tweet_id = tweet['id']
        pick_screen_name = tweet['user']['screen_name']
        pick_tweet_creation_datetime = datetime.strptime(tweet['created_at'],'%a %b %d %H:%M:%S %z %Y')
        retweets = retweets + self._get_replies(payload,pick_tweet_id,pick_screen_name,pick_tweet_creation_datetime,counter)
return retweets

Step 2. Extractive summarization

After we have a list of Tweets, we enter the second step of the process, summarization. The algorithm that we used for summarization was extractive in that it generated summaries from existing text fragments from the original source. The other alternative, which we did not use to stay true to the source data, is abstractive summarization. With abstractive techniques, the algorithm produces new fabricated text. The algorithms are unsupervised, so we do not need any labeled data and it’s much easier to apply to any problem domain such as the GRAMMYs. Because we are using extractive summarization, the algorithm does not require any domain knowledge. The resulting sentences are now compressed, focused, and concise.

Step 3. Paraphrasing

After we finish the extractive summarization, we post the sentences to a paraphraser and rewriter that attempts to boost the natural language quality of the Tweets. To accomplish this, we use a T5-small model from PyTorch and the Huggingface transformer library.

We used an existing transformer model to fine-tune our task of Twitter rewriting. An important ingredient for transfer learning is the unlabeled data set that is used for pre-training. This helps to learn the encoding representation of data. For pre-training, the data set must be high quality, diverse, and large. Unfortunately, Wikipedia does not meet all of these requirements. The corpora is large and high quality but uniform in style. Another common data set is Common Crawl, where web pages have been scraped. This data set is large in scale and diverse but low in quality. As a result, the team that trained the original transformer created the Colossal Clean Crawled Corpus (C4), which is a cleaned version of the Common Crawl and two orders of magnitude larger than Wikipedia. This meets the quality, diversity, and scale data requirements. C4 is available through the TensorFlow data sets.

The original model can accept a sentence and rewrite it to a new sentence. We fine-tuned the small T5 model with only 150 labeled exemplars. For example, you can see a Tweet within the training sample along with the label. We collected 150 of these pairs and created our own fine-tuned paraphraser for the GRAMMYS.

The paraphraser application exposed a Swagger endpoint so that a sentence could be posted for paraphrasing. The service encapsulates spell checking and quality measures to return the top-ranked sentence from the set of expanded sentences and the original sentence.

Swagger interface
Figure 4. The swagger interface for text paraphrasing

Step 4: Sentence quality

Now that we have rewritten the Tweets into natural language candidates, we apply a quality pipeline. In the first step, we check for any spelling errors and correct them. Next, the clean text is applied to a surface form quality measure. The en_core_web_lg library from spaCy encodes our words. This helps us to get the token positions and tags about each word. We use the part of speech patterns within a sentence to determine the surface form quality. We apply a set of rules in the form of case-based reasoning to give us a foundational score. Next, we trained an XGBoost tree on patterns of speech sequences with quality labels of true or false. The tree was applied to the extracted parts of speech sequences to retrieve a machine learning score. We used a list of tags and parts of speech to discern the quality. We used 1,897 exemplars, of which 70% were train and 30% were test. We achieved an accuracy of approximately 80%.

We picked the XGBoost algorithm from the lineage of decision trees. Decision trees learn branch and bound rules for patterns. When we ensemble many trees together, we call this bagging. To go further, a random forest uses a subset of predictors to build a collection of trees to ensemble together. Increasing in complexity, the boosting algorithm uses algorithms to sequentially learn after each model build by boosting the influence of higher performing models. In the learning process of trees, gradient boosting is used to minimize errors in sequential models. Finally, we arrive at our selection, which prunes poorly performing trees and adds regularization terms for an optimized gradient boosting algorithm.

We averaged both the rules and model-based score for an overall quality score. Only the highest-quality surface forms are retained and returned. This is how we pick the best paraphrased sentence.

Next, we check on the semantic quality of the sentence. We run a Project Debater polarity detection model and pick sentences that take a clear stance toward a topic. Then, we determine how well the sentence is aligned to the meaning of the sentence. If the best paraphrased sentence survives the quality checks, it then moves on to the next step, infringement detection.

Step 5. Lyric infringement detection

A challenging and important problem we had to solve was to ensure that no arguments infringed on musical lyrics. If any overlapping text is found within a song, the text is removed from the candidate argument list. To detect infringement, a search payload is sent to two bare metal machines. The query uses the lyric search and artist name of the topic, if applicable, in a search. If the artist name is available, we search the song lyric index only. We check for verbatim, in order, and slop. We experimented with different queries for each of the topics to get infringement detection coverage.

The infringement corpus came from the 29G LyricFind company. The data was ingested on two Redis bare metal machines with 32 GB of RAM and 2 TB of disk. The raw storage and 20 cores per machine ensured that we had enough compute and memory capacity for thousands of parallel text searches. Multiple concurrent threads pushed search queries during batch jobs every 4 hours.

Finally, all of the Twitter arguments that pass all of the filters are stored in a database, together with arguments collected on

Phase 2: Speech synthesis

After we complete the Twitter argument mining step, we then proceed to speech generation.

Speech synthesis components
Figure 5. Speech synthesis components of the architecture

As part of the GRAMMY experience, music fans are invited to join the debates on a dedicated submission page. After the submission, users can see how their submission compares to other submissions in terms of argument quality and polarity. They can also see the summary of all of the arguments collected in the previous days.

Arguments are manually reviewed for inappropriate content (such as offensive or hateful comments), and spam is filtered in a dedicated admin user interface.

Managing scale in web submission

The lower half of the architecture is focused on the consumer-facing applications and the general speech synthesis pipeline. Consumers will input arguments into the client-facing application. The arguments must be between 8 – 36 words long. When the length restrictions are passed, the react front end posts argument to a scaled-out messenger application. This app enrolls the argument onto a 10 partition Kafka topic. Asynchronously, the debater enroll interface Python app on IBM Cloud has 10 threads each listening on a Kafka topic partition. The app pulls in the argument, checks for infringement, and then posts to the Speech by Crowd platform.

In parallel, the submitted user argument travels through the global throttling solution on IBM Cloud Internet Services to the Debater API interface. IBM Cloud Internet Services has several edge functions that are mapped to subdomains. Every request to the system flows through the functions. The logic within the function determines a traffic drop percentage to support the target requests per second. Any request above the target receives a 429 response. The remaining traffic flows to three IBM Research API endpoints. The system provides immediate pro/con, quality, and key point matching responses on the singular argument. The pro/con and quality services run on CPU-based clusters while the key point analysis endpoint runs on GPU-based clusters. Each response is aggregated together and sent back to the GRAMMYs experience.

Each endpoint within the Debater API Interface writes a count record to Redis on IBM Cloud. A throttle controller Python app reads the data from Redis and determines a required drop percentage of traffic to stay under the maximum requests per second. The results are written to IBM Cloud Object Storage. The IBM Cloud Internet Services edge functions pull the drop percentage from IBM Cloud Object Storage to determine which requests to return a 429. This process shields the compute-intensive algorithms.

Every day at 5 a.m. ET, batch jobs take all of the arguments from Twitter and the website input and invoke the Speech by Crowd Debater pipeline on each topic. The Debater Generator application loops through all of the open topics and pulls the argument set from the Speech by Crowd argument database. Any argument that is marked as spam or inappropriate is not used within the speech generation process. Each of the arguments is assigned to either a pro or con list using the Project Debater polarity classification model. The argument list is supplied as a parameter along with the topic to the speech generation process. The speech generation process can take anywhere from 20 minutes to a few hours depending on how many arguments we are using. The resulting speeches, key points, arguments, language statistics, and grid points are then converted to a JSON form and stored in Cloudant. Any artifacts that have been approved through the Debater Review Tool are then pushed to IBM Cloud Object Storage, which is the origin for our Content Delivery Network. All of the batched data is served through this acceleration tier.

Now let’s examine how the speech generation process works.

Debater Pipeline

The Speech by Crowd platform performs several steps to generate speeches. First, the polarity of each argument is classified by a deep learning neural network. Next, we increase the spread of the raw polarity by taking the square root of the decimal. This helps us to further remove neutral arguments. The system then removes any irrelevant input text that is not aligned to the topic and has low quality. Next, the system takes the remaining arguments and begins the key point analysis process. The system selects approved, short, and high-quality sentences as potential key points based on the sentence’s quality assessment. From there, each argument is matched to a key point. The algorithm grades and ranks the prevalence of each key point by identifying how many sentences articulate the gist of the key point. In the fourth step, the narrative generation process selects the most prevalent key points and corresponding high-quality arguments to formulate a fluent narrative.

GRAMMY scale

With the volume and variety of data in addition to consumer traffic, we built several interfaces between the Speech by Crowd platform and the Debater APIs. We had to ensure that the platform could scale and handle hundreds to millions of users from around the world. Next, the real-time feedback around arguments had to achieve a 2-second response time. All the while, the system had to respect a hard request per second limit. These three requirements are challenging and potentially tradeoffs. All of the NLP artifacts such as key points and speeches had to run asynchronously. The run time depends on the number of arguments that grows during the event. Finally, the system needs to be scalable to handle undefined peak loads.

As shown in Figure 6, the hybrid cloud infrastructure that supports the GRAMMYs project has many different diverse parts. The real-time data processing part of the system handles dynamic loads. A total of 24 PODs managed by Red Hat OpenShift supports the workloads. The system only uses two GPUs in the real-time mode so that we can easily scale to available CPUs when the horizontal scalers sense CPU load. The real-time data flow does not have any caching, and we control load on the origin servers with global throttling.

The batch processing is very different. With fixed workloads, the system’s compute footprint can be managed without the need to handle spikes in traffic. 33 pods that are spread out over nine workers process the data. Note this difference. We use 12 GPUs for batch data so that the computation is faster without the need to scale out to less available GPUs than CPUs. Overall, we have approximately 12 AI and NLP models that ultimately create the speeches.

Real-time and batch data processing
Figure 6. Depiction of real-time and batch data processing cloud capacity

Now let’s look at the results of a topic.

GRAMMY Debates results

Figures 7-11 show initial results for the topic “Music education should be mandatory in all K-12 schools.” When we focus on the supporting arguments, you can see that many of these opinions are aligned toward brain development and self-esteem attainment. One of the arguments mentions that music training not only helps children develop fine motor skills, but aids emotional and behavioral maturation as well. You can also see another argument that states music stimulates brain development in children.

Supporting arguments
Figure 7. Supporting arguments for the music education topic

In the grid plot in Figure 8, each dot represents an argument. The spread of your arguments show that the crowd is highly polarized around this topic. In the purple, we have high-quality arguments that are against the topic. In gold, you can see the arguments that support mandatory music education. The crowd did not provide many neutral opinions. Notice how most of the arguments are of high quality. We filter out the lower-quality arguments that are used in the speech.

Argument plot
Figure 8. Argument plot that shows stance versus quality

Now, let’s look at the key points we found that support and contest the notion that music education should be mandatory in all K-12 schools. The algorithms summarized the crowd’s arguments into 10 supporting key points and 6 contesting key points. Each of the key points is distinct and unique with supporting arguments. In general, the crowd thinks that music helps children develop, is important for education, increases brain capacity, encourages creativity, sharpens one’s ability to listen, and is a good indicator for academic success. One interesting key point that surfaced is that all learners should have access to music education. You will notice that one of the cons and concerns of the crowd is that music education costs too much. You can also see that some segments of the crowd thought music education will distract kids from core subjects, schools lack funding, schools have too many courses, and that it will degrade knowledge.

Pro key points
Figure 9. Pro key points that support mandatory music education

Con key points
Figure 10. Con key points that contest mandatory music education

Finally, the key points and arguments were used to construct a fluent and cohesive speech. In Figure 11, you see a supporting speech that music education should be mandatory in all K-12 schools. The initial sentences of paragraphs restate a key point. We even provide the percentage of arguments that support the key point within the text. For example, you can see three highlighted themes. The first theme mentions that 21% of all arguments state that music in schools helps children develop better. As we go down further into paragraphs, the number of arguments that support a key point get smaller. Eleven percent of the arguments support the notion that music education is important to our schools. The supporting evidence is written as sentences from the crowd’s arguments. Next, 7% of arguments propose that music enhances brain coordination and increases brain capacity. The full speech had six total paragraphs that created a cohesive supporting narration.

Pro speech supporting mandatory music education
Figure 11. Pro speech supporting mandatory music education

Music + debating

Music and debating go together just like artificial intelligence and humans. The combination of both creates world-class experiences where we can gain a better understanding around all sides of an issue. Learn more at

Join us live!

Join the authors of this post live on March 16, 2021, at 12 p.m. (ET) as we talk about the solution we built for the GRAMMYs and show you how the argument results changed over the course of the two weeks leading up to the show. Sign up at: