What happens in your business depends upon what happens in the world around you. Population demographics, weather patterns, global economic and social trends all can influence your decision making. For decades, big businesses have leveraged this knowledge, building “big data” systems that blend customer location with demographic data to show who will buy what, and how much. In the past, only big business could afford this type of analysis. Data from specialized providers, hardware and software to process that data, and data science expertise are all extremely expensive.
But today, you can mitigate these costs by:
- Using free, open US Census data (pre-processed by your friendly IBM Developer Advocates)
- Deploying on the cloud to minimize hardware and software costs
- Re-using community-developed Python analytics programs, delivered in Jupyter notebooks
Sample Project: Simple Shopper Insights
I’ve built a basic analysis package in a Python Jupyter Notebook that makes customer demographic profiling accessible to startups and small-to-medium enterprises. If you know a bit of Python, you can understand this notebook, and even customize it with your own data. All you need is a CSV file with your customers’ zip codes.
This blog post walks you through data gathering, merging, and analysis. To follow along, open Simple Shopper Insights in nbviewer. This notebook’s also available on:
Tip: Contribute to this project. Pull requests welcome!
The good news is there’s a huge repository of demographics data available in the US for free. The bad news is that the repository is so huge it’s incomprehensible to a beginner. In an earlier blog post, I talked about how painful it is to get Census data directly from the US Census Bureau’s web site. Their offerings are so complex, they even offer courses. We’ve done the heavy lifting for you and compiled what we think is the most useful subset of the data into a few files aggregated to the zip code level.
You can find these data sets on IBM’s Analytics Exchange, a data portal that offers lots of open data sets that load easily into IBM analytics products, like DashDB, Watson Analytics, Cloudant, and DataWorks. This notebook uses API access, which provides access to the raw CSV file. This format is great for Python’s popular Pandas module, which we’ll use for statistical analysis.
US Census data in Analytics Exchange:
To customize this notebook work with your own account, you’d get your API key (your personal URL for the data file) and enter it where you call the data. For example, you’d replace
os.environ['AE_KEY_AGE']with your URL for the Age data CSV file.
In the notebook’s Census data section, you’ll see lists of the demographic stats (by code and name) we deemed useful and simple enough to be helpful.
read_csv command lets us access the CSV and extract only those columns we want with the following commands:
census_age_df = pd.read_csv( os.environ['AE_KEY_AGE'], usecols=['GEOID','B01002e1'] ) census_age_df.columns = ['GEOID','AGE'] ... census_income_df = pd.read_csv( os.environ['AE_KEY_INCOME'], usecols=['GEOID','B19049e1'] ) census_income_df.columns = ['GEOID','INCOME'] ... census_race_df = pd.read_csv(os.environ['AE_KEY_RACE'], usecols=['GEOID','B02001e1','B02001e2','B02001e3','B02001e4', 'B02001e5','B02001e6','B02001e7','B02001e8','B03001e3']) ... census_edu_df = pd.read_csv( os.environ['AE_KEY_EDUCATION'], usecols=['GEOID','B15003e1','B15003e2','B15003e3','B15003e4','B15003e5','B15003e6', 'B15003e7','B15003e8','B15003e9','B15003e10','B15003e11','B15003e12','B15003e13', 'B15003e14','B15003e15','B15003e16','B15003e17','B15003e18','B15003e19','B15003e20', 'B15003e21','B15003e22','B15003e23','B15003e24','B15003e25'])
The median income, median age, and race properties are in good shape, but education is more complex than we want, so we do some data munging to simplify, grouping people into four categories:
- no high school diploma
- high school diploma and/or some college
- at least a bachelor’s degree
- graduate degree
Finally, we divide education and race by the total population of each zip, to get percentage values. (Doing so solves the problem of sheer numbers weighting big zip codes more heavily than small ones.) We don’t have to do anything to median age and income since medians are already proportional values.
This work generates a simple set of key census characteristics by zip code, listed in the table at the end of the Census section.
Here’s where you’ll add your sales data into the mix. For this article, I’m using a sample sales file I found online in czuriaga’s bigml data gallery. It has sales of some anonymous products and includes a customer zip code for each sale. This data set is small enough to store and access on GitHub.
In a few steps, we can glean the important info: We grab the
US Zip, and
Country columns and filter for only
Product1. Create a new column,
GEOID, and populate it with the zip code prefixing with the string ‘86000US’ (to match the zip code format in the Census data). Then drop all columns except
Now we join the Census data to the sales data. As you see in the Merge sales to Census section, we use a Pandas
join to create a data set called
saleszips. Then under Identify sales zips, we take only the zip codes that have strong sales. The idea here is to get rid of outliers in the data set so that you’re looking only at strong demographic trends. This sample data set is pretty sparse (only 400 sales across the entire nation), so we take any zip code with more than one sale.
We save results in a data set called
highsaleszips, then use the Pandas
describe function to save some important descriptive statistics about
highsaleszips (mean, min, max and 40, 50 and 60 percentile cutoffs) in a new data set we call
Now we’re set up to perfom some meaningful analysis. In the Insights section, we start by returning to the nation-wide Census data and generating the same descriptive statistics we produced for
highsaleszips (mean, min, max and 40, 50 and 60 percentile cutoffs). Then we subtract the nation-wide values from the
highsaleszips values, giving us a clear picture of how demographics in our high-sales zip codes diverge from national averages.
We learn four main facts about zip codes in which we have sales:
- Incomes are much higher
- The education level is significantly higher
- People are a little younger
- The proportion of whites to other races is a little lower
Now that we better understand our customer profile, we can run a query to identify other zip codes that are similar to those where we’ve had success. That list can inform future marketing decisions, like where to target potential retail outlets, advertising, and direct mail campaigns.
First we pull out zip codes that are within 40%-60% of the measures in our sales zip codes for income, age, bachelor degree proportion, and white proportion.
censusgood_df = census_df [ ( census_df.INCOME > highsaleszipssum_df.loc['40%','INCOME'] ) & ( census_df.INCOME < highsaleszipssum_df.loc['60%','INCOME'] ) & ( census_df.AGE > highsaleszipssum_df.loc['40%','AGE'] ) & ( census_df.AGE < highsaleszipssum_df.loc['60%','AGE'] ) & ( census_df.BA > highsaleszipssum_df.loc['40%','BA'] ) & ( census_df.BA < highsaleszipssum_df.loc['60%','BA'] ) & ( census_df.WHITE > highsaleszipssum_df.loc['40%','WHITE'] ) & ( census_df.WHITE < highsaleszipssum_df.loc['60%','WHITE'] ) ]
Then we extract only the zip codes with this command:
Of course, you can save this list as a spreasheet, but to better communicate with your team, show results in a handy interactive map. To generate a map, our notebook uses CartoDB, a mapping-as-a-service provider that makes it easy to display an interactive map. We take our list of zip codes and create SQL INSERT statements which we then execute using the CartoDB python module. See indivdual cells for detailed commands.
Hopefully, you get a glimpse of how powerful it is to tap Census and other open data sets, combine with your own, then analyze with cloud-based software tools. This notebook only scratches the surface. We used a very simple method to identify good prospect zip codes. We manually chose our significant demographic characteristics (age, race, income, and education), and cutoffs (40%, 60%).
There are more sophisticated techniques. For example, you could apply automated machine learning to algorithmically figure out the most significant variables and best cutoff points. In future posts, we’ll look at some of these analytic options and see if they do a better job. Meanwhile, run through this notebook with your own data sets to see what you can discover.