Open Data is freely available, which means you can modify, store, and use it without any restrictions. Governments, academic institutions, and publicly focused agencies are the most common providers of open data. They typically share things like environmental, economic, census, and health data sets. You can learn more about open data from The Open Data Institute or from wikipedia. Two great places to start browsing are data.gov and data.gov.uk where you can find all sorts of data sets. Other good sources are the World Bank, the FAO, eurostat and the bureau for labor statistics. If you’re interested in a specific country or region, just do a quick Google search, and you’ll likely uncover other sources as well.
Open data can be a powerful analysis tool, especially when you connect multiple data sets to derive new insights. This tutorial features a notebook that helps you get started with analysis using pandas. Pandas is one of my favourite data analysis packages. It’s very flexible and includes tools that make it easy to load, index, classify, and group data.
In this tutorial, you will learn how to work with a
DataFrame in 3 basic steps:
- Load data from Analytics Exchange on Bluemix.
- Launch an Apache Spark service on Bluemix.
- Work with a Python notebook on Bluemix (join dataframes, clean, check, and analyse the data using simple statistical tools).
There’s now an easier way! New features available through IBM’s Data Science Experience make this tutorial faster and easier. Switch to the version of this tutorial that uses the Data Science Experience.
Data & Analytics on Bluemix
Bluemix (IBM’s cloud platform) includes Analytics Exchange, which features a selection of open data sets that you can download and use any way you want. It’s easy to get an account and grab some data:
- Login to Bluemix (or sign up for a free trial).
- From the menu at the top of any Bluemix screen, click Dashboard.
- Click the Data & Analytics tile.
- In the menu on the left side of the screen, click Exchange.
- At the top of the screen, in the Search box, type Life Expectancy. Click the Life expectancy at birth by country in total years data set.
- On the right side on the screen, click Apps & Notebooks to request a new access key.
- Click OK to agree to terms and conditions.
- Click Request a New Access Key.
- Click the key and copy the URL that appears. You’ll use it in a minute to load data into your python notebook.
- In the menu on the left side of the screen, click Services.
- Scroll down and click the New Service button.
- Find Apache Spark and click it.
- Click Choose Apache Spark, then click Create.
- Under Work with Notebooks and Spark click the Notebooks button.
- Click New Notebook.
- Click the From URL tab, give the notebook a name and in the Notebook URL field enter:
- Click Create Notebook.
Tip: If you don’t want to run the commands yourself, you can also just open the notebook in your browser and follow along: https://github.com/ibm-cds-labs/python-notebooks/blob/master/Analyse%20Open%20Data%20Sets%20Using%20Pandas%20in%20a%20Python%20notebook.ipynb
Load Data into a DataFrame
Paste the URL link/access key you copied from the Life Expectancy data set into the following code (replacing the <LINK-TO-DATA> string). Then run the following code to load the data in a dataframe. This code keeps 3 columns and renames them.
import pandas as pd import numpy as np # life expectancy at birth in years life = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area','Year','Value']) life.columns = ['country','year','life'] life[0:5]
Life expectancy figures might be more meaningful if we combine them with other data sets from the Analytics Exchange. Let’s start by loading data set Total Population by country and year. To do so, enter the following code and replace <LINK-TO-DATA> wherever it appears with your data URL/access key. Then run the code.
# population population = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value']) population.columns = ['country', 'year','population'] print "Nr of countries in life:", np.size(np.unique(life['country'])) print "Nr of countries in population:", np.size(np.unique(population['country']))
Nr of countries in life: 246 Nr of countries in population: 277
These two data sets don’t fit together perfectly. For instance, one lists more countries than the other. When we join the two dataframes we’re sure to introduce nulls or NaNs into the new dataframe. We’ll use the pandas
merge function to handle this problem. This function includes many options. In the following code,
how='outer' makes sure we keep all data from
on=['country','year'] specifies which columns to perform the merge on.
df = pd.merge(life, population, how='outer', sort=True, on=['country','year']) df[400:405]
|400||Antigua and Barbuda||1998||72.973780||74206.0|
|401||Antigua and Barbuda||1999||73.186024||76041.0|
|402||Antigua and Barbuda||2000||73.397293||77648.0|
|403||Antigua and Barbuda||2001||73.606073||78972.0|
|404||Antigua and Barbuda||2002||73.813390||80030.0|
We can add more data to the dataframe in a similar way. Return to the Analytics Exchange and for each data set in the following list, look up the link/keys for the data on Analytics Exchange and copy these into the code (again replacing the <LINK-TO-DATA> string with the corresponding URL/key).
# poverty (%) poverty = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value']) poverty.columns = ['country', 'year','poverty'] df = pd.merge(df, poverty, how='outer', sort=True, on=['country','year']) # school completion (%) school = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value']) school.columns = ['country', 'year','school'] df = pd.merge(df, school, how='outer', sort=True, on=['country','year']) # employment employmentin = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area','Year','Value','Sex','Subclassification']) employment = employmentin.loc[(employmentin.Sex=='Total men and women') & (employmentin.Subclassification=='Total.')] employment = employment.drop('Sex', 1) employment = employment.drop('Subclassification', 1) employment.columns = ['country', 'year','employment'] df = pd.merge(df, employment, how='outer', sort=True, on=['country','year']) # births attended by skilled staff (%) births = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value']) births.columns = ['country', 'year','births'] df = pd.merge(df, births, how='outer', sort=True, on=['country','year']) # measles immunization (%) measles = pd.read_csv("<LINK-TO-DATA>",usecols=['Country or Area', 'Year','Value']) measles.columns = ['country', 'year','measles'] df = pd.merge(df, measles, how='outer', sort=True, on=['country','year']) df[0:50]
The resulting table looks kind of strange, as it contains incorrect values, like numbers in the
country column and text in the
year column. You can manually remove these errors from the dataframe. Also, we can now create a multiindex with country and year.
df2=df.drop(df.index[0:40]) df2 = df2.set_index(['country','year']) df2[0:10]
If you are curious about other variables, you can keep adding data sets from Analytics Exchange to this dataframe. Be aware that not all data is equally formatted and might need some clean-up before you add it. Use the code samples you just read about, and make sure you keep checking results with a quick look at each of your tables when you load or change them with commands like
Check the Data
You can run a first check of the data with
describe(), which calculates some basic statistics for each of the columns in the dataframe. It gives you the number of values (count), the mean, the standard deviation (std), the min and max, and some percentiles.
At this point, we have enough sample data to work with. Let’s start by finding the correlation between different variables. First we’ll create a scatter plot, and relate the values for two variables of each row. In our code, we also customize the look by defining the font and figure size and colours of the points with matplotlib.
import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['font.size']=11 plt.rcParams['figure.figsize']=[8.0, 3.5] fig, axes=plt.subplots(nrows=1, ncols=2) df2.plot(kind='scatter', x='life', y='population', ax=axes, color='Blue'); df2.plot(kind='scatter', x='life', y='school', ax=axes, color='Red'); plt.tight_layout()
The figure on the left shows that increased life expectancy leads to higher population. The figure on the right shows that the life expectancy increases with the percentage of school completion. But the percentage ranges from 0 to 200, which is odd for a percentage. You can remove the outliers by keeping the values within a specified range
df2[df2.school>100]=float('NaN'). Even better, would be to check where these values in the original data came from. In some cases, a range like this could indicate an error in your code somewhere. In this case, the values are correct, see the description of the school completion data.
We don’t have data for all the exact same years. So we’ll group by country (be aware that we lose some information by doing so). Also because variables are percentages, we’ll convert our employment figures to percent. Probably, we no longer need the
population column, so let’s drop it. Then we create scatter plots from the dataframe using
scatter_matrix, which creates plots for all variables and also adds a histogram for each.
from pandas.tools.plotting import scatter_matrix # group by country grouped = df2.groupby(level=0) dfgroup = grouped.mean() # employment in % of total population dfgroup['employment']=(dfgroup['employment']*1000.)/dfgroup['population']*100 dfgroup=dfgroup.drop('population',1) scatter_matrix(dfgroup,figsize=(12, 12), diagonal='kde')
You can see that the data is now in a pretty good state. There are no large outliers. We can even start to see some relationships: life expectancy increases with schooling, employment, safe births, and measles vaccination. You are deriving insights from the data and can now build a statistical model–for instance, have a look at an ordinary least squares regression (OLS) from StatsModels.
In this tutorial, you learned how to use open data from Analytics Exchange in a Python notebook. You saw how to load, clean and explore data using pandas. As you can see from this example, data analysis entails lots of trial and error. This experimentation can be challenging, but is also a lot of fun!