Think 2021: New tools have the developer ecosystem and IBM building together Learn more

Behind the code: Machine learning pipeline for Fantasy Football insights

ESPN and IBM have teamed up to bring a new level of insight to fantasy football team owners: Watson AI. #FantasyFootballFace. Watson is built on an enterprise grade machine learning pipeline to read, understand, and comprehend millions of documents and multimedia sources about fantasy football. The ESPN Fantasy Football with Watson system has been a significant undertaking with many components.

This article is the fourth in an eight-part series that takes you behind each component to show you how we used Watson to build a fair world-class AI solution.

AI Fantasy Football insights with a machine learning pipeline

Watson goes deep into the data so that you can go deeper into selecting your fantasy football lineup. Managers can pick the players that will likely boom and avoid those that will bust. However, high-risk players that could potentially bust could have a higher ceiling. ESPN Fantasy Football with Watson provides you with AI insights so that you can swing for the fences. High risk could mean high reward. Alternatively, you could play it safe by trading boom probabilities with floor scoring potential. The machine learning pipeline reasons from the avalanche of data to align with your shifting strategy week over week.

overview of machine learning architecture

The machine learning architecture of the system is composed of several applications, dozens of models, and thousands of data sources and data science environments. First, the system had to be taught to read fantasy football content. A language model was designed with custom entities and relationships to fit the unique language that people use to describe players and teams in the fantasy football domain. To read for comprehension, an ontology of 13 entity types were defined that covered player-centric understanding. The entities include body part, coach, fans, gear, injury, location, player, player status, treatment, positive tone, negative tone, team, and performance metric. 1,200 documents created an even distribution of documents that had each entity type from the 5,568,714 articles.

A team of 3 human annotators used a tool called Watson Knowledge Studio to annotate text as 1 of 13 entity types. The documents were pre-annotated from 10 dictionaries that searched for words and automatically created the annotation. The annotators corrected pre-annotations while adding others that were missed. Each day, the team met to discuss their kappa or agreement score between each other over each entity type. Over a span of 3 weeks, the team produced a statistical entity model with a precision of 79%, recall of 73%, and an F1 score of 76%. Even with a 14% entity word density over all documents, the overall annotator agreement score was at 70% with the majority of differences being the omission of a few words in a phrase.

A statistical entity detector was trained and deployed to our system called Watson Discovery that continually ingests sources from over 50,000 sources. The Watson Discovery system is able to discover fantasy football entities, keywords, and concepts from the continually updating corpora based on our trained statistical entity model.

Next, the system used a document to vector model to understand the natural text from a query. A very specific query such as “Tom Brady and Patriots and NFL and Football” was initially issued to Watson Discovery-trusted sources as described in Part 3. If a query did not return at least 50 documents that were experimentally determined, the query was broadened until it only had “Tom Brady and NFL.” If we still did not have at least 50 pieces of evidence, the precise-to-broad queries were sent to Watson Discovery neutral sources. From the query result, a list of entities, keywords, and concepts for each document was converted to numerical feature vectors. Each of the feature vector groups (that is, entities, concepts, and keywords) were averaged together to represent a semantic summarization of the content. All of the feature vector groups from each document were averaged across documents.

The word to vector model was tested with two different types of semantic meaning. First, an analogy test was provided to the model. If the relation Travis Kelce is to the chiefs as Coby Fleener is to the X is presented to the model, the correct answer would the Saints. In the player-to-team analogy testing, the correct answer was in the top 1% of the data 100% of the time. The team-to-location analogy was slightly lower with a 93.48% accuracy because the natural queries were not focused around teams. The second test provided a set of keywords to the model and expected a related word. For example, if Coby Fleener is input into the model, we would expect to see the Saints as output. The model achieved 80% accuracy on player keyword tests where the correct answer was in the top 0.1% of the data. The team and location keyword tests performed similar with 74% accuracy where the correct answer was in the top 1% of data.

Next, the 3 feature vectors along with player biographics are input into the deep learning portion of the pipeline.


The deep learning pipeline had 4 models that were over 98 layers deep. The nodes in the neural networks used a combination of rectilinear, tanh, and sigmoid activation functions. The models were classifiers for each player to determine the probability of a boom, bust, play with an injury, or play meaningful minutes. The probability scores provide a confidence level of player states so that team owners can decide their own tolerance risk.


The bust game classifier had an accuracy of 83% with a modest class separation, while the boom classifier had a very low accuracy of 39%. Players with high probabilities that they were going to break out significantly outscored their projections on average. The boom players that were missed and marked incorrect were very close to the binary threshold of 0.5. Further, the negative predictive value of the boom model is 85.5% accurate, and it produces a real-world percentage of boom players at 12%. Overpredicting booms could be worse than a high accuracy. The accuracy number is not as meaningful an evaluation metric as the negative predictive value and percentage of players predicted to be a boom.

The play with injury classifier had an accuracy of 67% with a positive predictive value of 68.1%. The positive predictive value is very important for this classifier so that we know if a player is going to positively play with an injury or hidden injury. The play meaningful minutes model produced an accuracy of 61.7%. Each of the deep learning models provides invaluable natural language text summarization for player states.

Finally, the outputs of the deep learning layers along with structured ESPN data were input into an ensemble of multiple regression models. A multiple regression ensemble based on player position provided a point projection for each player. A linear combination of deep learning player states and ESPN statistical data produced the best RMSE score of 6.78. On average, each player that is projected to score significant points over all positions will have a projection score that is off by 6.78 points.


To produce probability distribution functions (PDF), the end of the machine learning pipeline fit 25 different PDFs to a player’s historical performance. Some example distributions include alpha, anglit, beta, bradford, chi, wald, vonmises, person, normal, Rayleigh, and so on. If the player did not have enough historical data, similar player data was retrieved. After a PDF was created for each player, 10,000 points were sampled over the distribution within a simulation to provide discrete points for curve shape rendering on the user interface.

The merging of natural language evidence with traditional statistics produced a score projection for every player. On average, the combination of structured and unstructured data produced a better RMSE than each independently. A lot of the data exploration was performed within Jupyter Notebooks and IBM SPSS software. Through experimentation, we selected model hyper parameters and algorithms through the project.

image showing 4 quadrants

After Watson was able to read, understand, and comprehend the data through the core machine learning pipeline, a set of post processors were applied to the outputs. Through the use of the IBM Fair 360 libraries, a post processor was developed to ensure that each team was treated fairly. For example, players on the New England Patriots were predicted to boom more often than other players on different teams. In reality, the variance between the number of players that boomed on each time was very low. A fair post processor model was built using the regression output with ground truth data. The output of the fair post processor corrected the overall bias in the 50,000 sources towards specific teams.

A second post processor altered the confidence values of the deep learning equations. Each of the deep learning thresholds for classification were different. To make boom, bust, play with injury, or play meaningful minutes comparable, each of the values was normalized. Each of the confidence values are now centered on 0.5 to be on the same scale.

Training the Fantasy Football machine learning pipeline

Throughout the project, we used historical news articles, blogs, and so on, associated with players from the fantasy football seasons 2015 and 2016. A third-party named Webhose provided the large-scale content. In total, over 100 GBs of data was ingested into Watson Discovery using our custom entity model. We correlated the article date with structured player data from ESPN to generate labeled data. The ESPN player data contained several statistics that included week result, projection, actual, percentage owned, and so on.

A week span date from Tuesday to Monday was associated with each player state so that a time ranged query could be run to retrieve relevant news articles. The boom label was determined by analyzing actual and projected scores along with percentage of team managers that owned a player. If the actual score of a player was greater than 1 standard deviation above the projection for a player p, the weighted average of the differences between the actual and projected score by the percentage owned, perownedp0.1, is used to determine a boom standard deviation. However, the player must be owned by at least 10% in all leagues.


The standard deviation for boom, σbo, is determined by taking the square root of the boom variance, σbo2.


The label boom is applied to the player if their actual score, x, is greater than 1 boom standard deviation above the boom mean for the player.


The bust label is calculated by the following equations. The average bust score, μbu, is determined by weighting the difference between score actuals and projection by the same player’s projection. However, only actuals that are 1 standard deviation, σp, below the projected scores are used within the sample set.


The square root of the bust variance, σbu2, provides the standard deviation threshold to label a player with score x.


The play with injury label was generated only for players that scored greater than 15% of their projected points and they were on the injury report as questionable or probable. The play meaningful minutes label was created when a player scored greater than 15% of their projected points and was probable or not on the injury report. Each of the four labels was generated for every week of every player within the fantasy football 2015 and 2016 seasons.

Watson provides fair and expert AI insights for fantasy football managers across the world. The scientific process of training and evaluating the machine learning pipeline creates highly accurate predictions to help you win. #WinWithWatson

Check back next time as I discuss the delivery of AI insights through mobile experiences. To find out more, follow Aaron Baughman on Twitter: @BaughmanAaron.

The ESPN Fantasy Football logo is a trademark of ESPN, Inc. Used with permission of ESPN, Inc.