# Testing the Watson Personality Insights Machine Learning Model

## Introduction and background

Clients and business partners often ask us about the accuracy of the IBM Watson Personality Insights service and the source of the personality characteristics that we infer from the input text. Watson Personality Insights uses a machine learning model. A typical measure of accuracy of any machine learning model is to compare the scores produced by the machine learning model with those obtained from ground truth, and to report the agreement or deviations. To test the accuracy of the Watson Personality Insights service, we compared the ground truth with the personality characteristics scores that were predicted by our models. The machine learning model uses a regression algorithm to predict the score. Ground truth is the statistics obtained through direct observation of the data. To obtain the ground truth for our Personality Insights service, we used the psychometric survey-based scores for about 500 Twitter users.

## Limitations of psychometric survey-based scores for ground truth

Survey-based personality estimation is based on self-reporting, which might not always be a true reflection of one’s personality: Some users may misinterpret the intention of a question, some people tend to describe themselves the way they would like to be perceived instead of who they actually are, and some people just fill out a survey without taking the time to think about their answers. We took the following measures to counter some of the known limitations of surveys: Attention-check questions were included in the survey and used to filter responses. We also filtered out responses that completed those surveys too quickly, i.e. less than 15 minutes. It’s not perfect. Any model that tries to understand humans is not going to be perfect, however, administering psychometric surveys is the best ground truth collection method we have presently. Humans are too complex to be bound by any models, but we have to have some means to base success of our machine learning model.

## Participants and sample representativeness

We conducted the study in two phases. First, we recruited 256 colleagues within IBM who have a personal Twitter account with at least 200 tweets posted. To obtain a broader representative sample in the second phase, we recruited 237 active Twitter users from Amazon’s Mechanical Turk. These 237 subjects also had at least 200 tweets written in English. The full sample contained 493 data points.

Our sample population selection was random. Additionally, to ensure the representativeness of our sample, we checked the distribution of each personality trait score from psychometric surveys among all participants. All Big Five, Values and Needs trait scores conform to the law of normal distribution. Therefore, we concluded that there is no sampling bias in the sample population.

## Personality traits, ground truth collection, and our approach

We administered psychometric surveys to each participant to obtain ground truth. We asked each participant to take three sets of psychometric tests: a 50-item Big Five test (derived from the International Personality Item Pool, or IPIP), a 52-item fundamental Needs test (developed by IBM), and a 21-item basic Values test(developed by Schwartz et al.). We also collected each participants Twitter tweets, and inferred their personality characteristics using our machine learning model. We then compared the scores to note the accuracy of our model.

## Results

To study the accuracy of our model, we performed correlation analyses to compare the trait scores that were calculated by using our Personality Insights machine learning model with the corresponding psychometric measures collected from administering the surveys to all 493 subjects. Because our three models (Big Five, Values, Needs) have multiple dimensions, we used the RV-coefficient measurement metric to examine the overall correlation. RV-coefficient is a multivariate generalization of the squared Pearson correlation coefficient (because the RV coefficient takes values between 0 and 1). It measures the closeness of two sets of points that may each be represented in a matrix.

We found that for all three models, derived scores are all significantly correlated with survey-based scores (p-value < 0.05). To understand whether our derived scores show improvement over randomly generated scores, we also compared the survey based scores with randomly generated scores for each trait. Specifically, for each user, every trait score in Big Five, Values and Needs is a random number generated from 0 to 1. The correlation between random scores and ground truth was not statistically significant: p-value around 0.5. This demonstrates that our models show significant improvement over a random generator.

We also assessed the accuracy from a users’ perception point of view. Participants were asked to rate, on a five-point scale, how well each derived characteristic matched their perception of themselves. Their ratings suggest that the inferred characteristics largely matched their self-perceptions. Figure 1 shows that means of all ratings were above 3 (“somewhat”) out of 5 (“perfect”). For Big Five, means were 3.3 with standard deviation of 1.18, for Needs, means were 3.37 with standard deviation of 1.22 and for Values, means were 3.4 with standard deviation of 1.18.

## Conclusion

Our study demonstrated that all Big Five personality traits (openness, conscientiousness, extraversion, agreeableness, and neuroticism) plus Needs and Values personality traits that we derive from Twitter text for ground truth show a statistically significant correlation with the corresponding traits that can be gathered from psychometric surveys. Based on this result, we concluded that inferring a persons personality from Twitter text for the ground truth is an acceptable proxy to administering surveys. The advantage of Twitter text over surveys is that it’s more scalable, less time consuming, and removes the researcher imposition – meaning that a researcher may miss asking an important question, or a respondent may not properly interpret the intention of the question.