Playing fair with ESPN Fantasy Football Watson Insights

With contributions by Stephen Hammer, Jeff Powell, and John Kent

We have all been there. You have several fantasy football choices to make and the opportunity cost of each decision seems about equal. Naturally, we pick the choice that we like the best based on what we know or can recall. Fantasy Football has many of those types of choices. Maintaining analytical mindfulness is very difficult when each of us have our own prior beliefs that can be influenced by recent events. Watson Fantasy Football Insights brings fairness back into your decision-making process. Watson has been taught how to identify bias that originates from the crowd or the system itself so that unfair results are mitigated. From team manager users to laypersons, each can now view fair boom and bust opportunities for football players. Your roster will have a highly tuned upside without the risk of a biased-based low side. #WinWithWatson

In part 2 of 4 in this blog series, we describe the bias system architecture, the identification and mitigation process, and the results of fair predictions.

Overall bias system architecture

Before we go into the bias system architecture that is shown in Figure 1, let’s define some terminology. The goal of Watson is to ensure that all features and value pairs that describe each player is a protected attribute. A protected attribute partitions players into a set of groups where they have parity in terms of benefit, such as the probability of booming. However, we want to remove privileged values across groups of players that cause unequal treatment of players such as which team a player belongs to. This way, we are able to maintain group fairness because players are receiving treatment even if they are on a less popular team. Watson also must consider individual fairness where similar players receive the same potential outcomes. To quantify fairness, Watson uses a fairness metric that can identify bias. A bias mitigation algorithm helps us to remove unfairness that originates from data or models.

Delivering a fair artificial intelligence (AI) experience is very challenging. We use many different types of source data such as videos, podcasts, news articles, and blogs that might contain unwanted bias. Further, the machine learning pipeline that was described in Part 1 was trained with supervision. The process of providing labels or having a small group of human annotators can introduce polarized opinions through bias creep. Any unwanted systemic errors can be found within the answer keys, training, or test data that is used to teach Watson how to assess each football player. Even when creating a model topology with neural networks, an architect is unknowingly inserting structural bias. This bias can be exacerbated by the selected type of algorithmic learning process, activation function, and weight initializations.

The architecture of the system injects bias detection and mitigation techniques into the overall machine learning pipeline. Videos and podcasts are discovered through ESPN designated sites. The videos and podcasts are transcribed into text and enrolled into an IBM Watson Discovery collection that has been taught how to read fantasy football vocabulary. In parallel, the article and blog texts from over 50,000 sources are similarly ingested. Entities, keywords, and concepts are extracted about each player and input into a deep learning autoencoder. The autoencoder projects each word into a 300 high dimensional space where a floating point represents each component. The feature vectors are averaged together to get a source summarization about each of the top 300 fantasy football players. The summarized vector is sent to four deep learning classifiers to determine whether a player will boom, bust, play with a hidden injury, or play meaningful minutes. The output of the class probabilities along with game statistics are input into an ensemble of non-linear multiple regression models.

The debias identification and mitigation process Figure 1. The debias identification and mitigation process

The regression models run on Watson Machine Learning. The traffic and output of Watson Machine Learning are output to both IBM Watson OpenScale and the machine learning pipeline. Watson OpenScale accepts the traffic to discover bias within preselected protected attributes. Fantasy football users can view the dashboards and see whether bias is found on the operational data. The multiple regression models provide the player estimated projections that are inputs for the simulation. Candidate probability density functions (PDF) are fit to the combination of historical point projections and the current projection. The best fit PDF is used by the sampling and simulation process.

Points are selected and sampled from the PDF. The points are approximations for a simulated game state for each of the top 300 players. Next, each of the predictions from the machine learning pipeline are shifted and normalized to be in fantasy football representations. The boom and bust values are debiased after the normalization process. This debiasing process gives equal equity across boom and bust based on the player team’s roster. The debiased predictions are stored within Db2 on IBM Cloud. A content generation system queries Db2 to generate AI insights that are distributed across the world.

Bias identification

Bias detection was explored, focusing several protected variables with potential privileged values. First, a group of multiple regression models was formed as an ensemble of models. The appropriate model was applied given a player’s position. A total of six models were trained and deployed on Watson Machine Learning. The machine learning as a service was consumed within the bias pipeline.

Watson Machine Learning models deployed to fantasy football score projections Figure 2. Watson Machine Learning models deployed to fantasy football score projections

The predictors for the projection score included play with injury state, play with injury probability, play meaningful minutes state, play meaningful minutes probability, bust state, bust probability, boom state, and boom probability. Several other ESPN statistics were included within the model. The play states and boom or bust probabilities were used to monitor for bias. The favorable labels for each monitor were position-specific. For example, the wide receiver position has the favorable label set as any projection greater than 10 points. This enabled the system to determine whether higher scoring players were predicted to boom more than lower scoring players even though the score spread was similar. The system should not be biased towards players that have a ceiling. A player that is projected to score two points might boom with a total of four points while a different player might boom with 26 points. The probabilities should be representative of players’ boom potential irrespective of score magnitude.

Several types of bias identification algorithms can be used to debias a model. An equal opportunity metric compares true positive rates between protected groups. A criterion can be supplied to find discrimination between a specified sensitive attribute. This metric has limited utility with label bias in training data. As such, the disparate impact metric determines the ratio of the probability of a positive classification for both groups for a selected protected attribute. The selection process can have widely different outcomes for different groups. This metric has limited use for large class imbalances within the data. Within Watson OpenScale, a combination of disparate impact with prioritized individual bias is used. Here, we measure the ratio of the probability of a positive classification for both groups and individuals.

Other techniques can include the index that generalizes entropy of benefit for all individuals. The statistical parity difference looks at the difference of the rate of favorable outcomes. In our final example, the average odds difference measures the discrepancy of the false positive and the true positive rates.

Bias mitigation

If bias was detected with Watson OpenScale and through experimentation, mitigation techniques were applied to the output of the models. AI Fairness 360, which was developed and released by IBM, was selected to run debiasing algorithms. As Figure 3 shows, a debiasing algorithm can run as a post processor, within the model or as a post processor.

Debiasing algorithm selection Figure 3. Debiasing algorithm selection as re-created from

The first option is to minimize or remove bias as a preprocessor. In this case, we want to transform the data before the model is applied or trained. A training and test set are used in combination with a bias detector to determine whether the labeled data bias has been decreased by a trained fair preprocessor. If so, we keep the fair preprocessor and transform all data before we train, test, or apply the classifier. A new training and test set is created by the preprocessor. When the classifier is trained and applied, output data is compared to the transformed test set to ensure that we have a fair classifier. The preprocessor option can take some time to implement. This method of debiasing requires a rebuild of the original model. An example of a bias mitigation preprocessing is the reweighing method that weights examples in each combination to ensure fairness before classification. Another example includes the optimized preprocessing that learns a probabilistic transformation to modify the features and labels of the training data.

A second option that is called a fair in-process algorithm can be selected. The classifier is trained with bias and accuracy objectives to find the best combination of hyperparameters and model topologies. After the model is trained, it can be evaluated with the original data set based on the chosen objective function. This method can be one of the most complex choices and requires a rebuild and a potential redesign of the model. However, the in-process algorithm can also be one of the most effective solutions. An example of an in-process mitigation algorithm includes the adversarial debiasing. This algorithm learns a classifier that maximizes accuracy and reduces an adversary’s ability to determine the protected attribute from the prediction.

The method used within the Fantasy Insights with Watson that we selected was the fair post-processor. The debias algorithm trains a post processor to adjust the output of an unfair classifier so that it is fair. Like traditional machine learning, we train and test a classifier based on an objective function. The predicted data set can be used as test data while the original data set with a validation set can be used for training with a bias objective function. With this method, we do not need to retrain the original model. An example post-processor includes the reject option-based classification that changes predictions from a classifier to make them fairer.

The open source AI Fairness 360 provides several post processing algorithms within the aif360.algorithms.postprocessing package. The Calibrated equal odds post processing algorithm was implemented within the overall debias pipeline that is described in Figure 1. Privileged and unprivileged groups were established with biographical information such as team, height, weight, age, and years of experience. Through experimentation, we found that team was the most acutely unfair predictor! The classifier optimized classifier score outputs to find probabilities of which to the change the output classes.


Because we selected the team membership for each player as the protected attribute with privileged values, we were able to debias player boom and bust percentages. Through an analysis of a group of NFL teams, we found that certain team memberships caused players to be underpredicted as a boom. Most of the players from these teams were likely to not boom. After further analysis, the articles and blogs about certain teams were overwhelmingly negative despite potentially positive play. The source data that was used to train and later applied to the models was inherently biased.

Boom model fairness improvement Figure 4. Boom model fairness improvement

The players from these teams were most likely not in the boom class as shown in Figure 4. The red bar depicts bias against players. After the post processor was applied, the players were maintained parity with their other counter parts. The closer the objective function the calibrated equal odds post processing algorithm was to zero, the fairer. The green bar shows a fair result. As a result, we created a fair post-processor model that produced non-team biased boom and bust predictions.

Time for fair play

You can now read and research the players that interest you while letting Watson debias your selections. Now, you can give your favorite team the attention they deserve while still winning your fantasy football league.

The next article in this 4-part series will discuss the social sharing of your player cards. Get ready to send your friends some Watson smack talk #WinWithWatson.

Aaron Baughman