The Analyze open medical datasets to gain insights code pattern is for anyone new to Data Science Experiment (DSX) and machine learning who is also interested in social justice or health issues. The code pattern guides beginner data scientists through running various machine learning classifiers and comparing the outputs with evaluating measures.
The pattern uses data from Kaggle’s opioid project, which supplies a small dataset from 2014’s opioid statistics including values such as deaths by opioid overdose, type of prescriber, and the prescription. Using this small dataset, the pattern explains how to clean data and use it with machine learning classifiers as well as how to apply information to data that is relevant to current issues.
For context on why opioid prescribers are a relevant topic, it’s important to know that opioid overdoses are becoming an increasingly overwhelming problem for the United States, where tens of thousands of Americans are dying every year. And the problem is not going away. Rather, it is only getting worse. This holds especially true with the widening availability and use of fetanyl, a non-methadone synthetic, that is 50 to 100 times more potent than morphine, according to the National Institute on Drug Abuse (NIDA). Due to this factor, as well as the highly addicting nature of opioids and the prevalence of their prescriptions, there have been a shocking amount of deaths, usually in the tens of thousands, caused by overdoses each year. Though we, as data scientists, might not be able to single-handedly fix this problem, we can dive into the data and see what exactly is going on and what elements might lead to certain outcomes.
That is what this code pattern strives to do. It dives into the before-mentioned Kaggle dataset, looking at how we can apply data scientist skills to bring greater social good–or at least greater social understanding. In the pattern, you’ll first explore the data in a DSX notebook by learning how to clean the data and visualize a few initial findings in a variety of ways, including geographically, using Pixie Dust. You can see two examples below:
By looking at the previous map, you see that the Appalachia region and some of the Northeast states were most affected by opioid overdoses in 2014 whereas other states such as North and South Dakota were less affected. When you look at the bar graph, you can see that the most affected state was West Virginia, followed by New Mexico, New Hampshire, Kentucky, and Rhode Island. North Dakota, South Dakota and Nebraska were clearly the least affected. Though this code pattern focuses on predicting prescribers, by viewing data visually like this you can get a better feel for the data before creating the models and seeing where in the U.S. the most concentrated amounts of opioid overdose are.
After the initial exploration is complete, you can use the machine learning library, scikit-learn, to train several models on the data and determine which have the most accurate predictions of opioid prescriptions. If you’re unfamiliar with Scikit-learn, it’s a machine learning library commonly used by data scientists due to its ease of use. Specifically, by using the library you’re able to easily access a number of machine learning classifiers that you can implement with relatively minimal lines of code. Even more, scikit-learn lets you visualize your output, showcasing your findings. Because of this, the library is often used in machine learning classes to teach what different classifiers do–much like the comparative output this code pattern highlights! The models chosen for this code pattern align with Kaggle’s challenge, focusing on logistic regression, naive Bayes, random forest, gradient boosting, KNN, decision trees, LDA, bagging classifier, and an ensemble method. By applying all of these models, you can see how each model performs and which to move forward with. The scope of this project does not go past an introduction to implementing machine learning models, but it provides the foundation for a potentially larger project. In fact, given that someone has access to even more data, one would be able to construct an effective predictive model for future years of opioid prescriptions and prescribers.
After running various classifiers, we find that random forest, gradient boosting, and the ensemble models had the best performance comparatively. This means that if we were to build a larger project, we could focus on these particular classifiers, building upon them to help predict opioid prescribers (given more years of data).
Going further, we evaluate the models with a set of metrics including a precision-recall score. We use precision-recall to help determine the success of the trial. Precision-recall scores represent a balance between high recall and high precision where you have four outcomes: true positive, true negative, false positive, and false negative. The more successful you are, the less false positives and false negatives you have and the more true positives and true negatives you have. The precision-recall score represents this total, and the precision-recall curve helps to visualize it. In the following graph, the y axis is the precision and the x axis is recall. If the graph went straight across the middle, that would be a random-like output or 50% probability. Given this, if the curve fell below that middle line, it would mean poor performance whereas if the curve went above the middle line, it would be a more accurate and a better quality performance. For example, if it were at 100, or all the way to the top, it would be a perfect classifier and if it were at 0, it would be completely inaccurate. Because our classifier is at 0.84 we can feel confident that our precision was of good quality. However, there is room for improvement.
With that information, I encourage you to build upon this pattern to help bring awareness to the opioid overdose crisis as well as potentially predicting what is to come. I hope you’ll take the time to check out my code, follow along, build upon it, and beat the score!