We’re working to post Census data to the cloud for public use and analysis. It’s turning out to be a daunting task. Today, I’m sharing the story of our efforts, so you’ll know what we’re up to and up against. Included are some tips and tricks for working with large data sets.

The end goal is to get Census data into dashDB, where I could leverage dashDB’s SQL and R analytic capabilities to build some useful derived data products.

Open Data

I’m passionate about (ok, obsessed with) the creation of a simple, low-cost way to provide general access to public data. I’ve created some one-off data sets like Boston crimes 2013 (which powers SafetyPulse) and the US CDC flu reports, but now I wanted to tackle one of the most important open data sets available, the US Census.

Why the Census? Demographic Insights!

Imagine if you found fast and easy access to census data on IBM’s cloud-based data services. You could use and/or combine Census data with your own. You’d be empowered to make important decisions based on data not conjecture. For example, a store could generate demographic profiles of its customers and use their spatial distribution to gain insights into critical business decisions, like:

  • where to put new stores
  • where to close old stores
  • in which markets will advertising have the best payoff
  • how to optimize distribution and warehousing

Big companies with deep pockets have been doing this kind of demographic analysis for decades. The cost of data acquisition and shaping alone can run into the hundreds of thousands of dollars, and that’s before you get to the analysis.

We hope to put these capabilities within your reach, and we’re working to deliver Census data pre-shaped and ready-for-use.

The Census challenge

US Census data is free for download on their web site, so why aren’t there already a bunch of cloud services offering demographic analysis? The answer is: Census data is extremely complex.

  1. Breadth: The U.S. constitution mandates an official count of the population. The number is used to determine how many congressional and electoral college representatives different areas get and also determines federal funding of various local programs. The Census groups data into various spatial chunks, like block groups, tracts, metropolitan areas, counties, and states. Knowing which level you want and how to get only that data is the first hurdle.
  2. Depth: Census originally counted heads, but the government soon realized that while they were going door-to-door, they could capture some additional details. Today the Census data includes race, education, sex, language, family size, occupation, plumbing, distance to work, and hundreds more variables. Culling through all these data points, understanding the naming conventions, and extracting what you need is a major challenge.
  3. Size: This isn’t data you can bring into Excel. You’ll get Gigabyte-sized text files, and your favorite text editor will choke on them. Mine did. Even vi and emacs nearly grind to a halt. If you want to massage this data, dust off your Perl, awk, and sed skills.
  4. Methodology: For reasons known only to graybeard statisticians, there are hundreds of Census products. For example, I wanted the most recent data. I found the latest is from 2013. Fine. But within that group, you must choose between 2013 ACS 1-year Estimates, 2011-2013 ACS 3-year Estimates and 2009-2013 ACS 5-year Estimates. Which is better? I don’t know yet, but I’m sure there’s some grad student writing a paper on it. And don’t even ask me about what PUMS data means.

Choosing the data

To avoid some of these data-processing hurdles, I sought a shortcut. I knew I wanted the most recent data, in GIS format, including only the most popular statistics—not the full list of many thousand measures. So I found the 2013 American Community Survey (ACS).

What is the ACS? Until recently, the full Census ran every 10 years, and took a few years to process and publish. A combination of budget cuts and time pressure forced the Census to survey a sample of the population instead, and release that data more quickly. That’s what the ACS data is.

I wanted only the most general statistics, but with a detailed spatial resolution, so our users could understand what’s going on at the neighborhood level in dense metropolitan areas. I found data aggregated at the block group level and offered in the Esri File Geodatabase format. It checked in at a whopping 7.3Gb.

Data shaping travails

I finally have my data set, but it’s not yet usable. The spatial data set ships with only a few administrative properties attached, giving you the state, tract, and block group codes. The actual survey data is in separate tables with names like “X00_COUNTS”, “X01_AGE_AND_SEX”, etc. To get meaningful data, I needed to join those tables to the spatial data set.

Census data in QGIS

I turned to QGIS a great open source tool for working with with Esri File Geodatabases (if you don’t have ArcGIS), which helped me join the right columns to the spatial data.

Next, I had to format the data for import. I tried to export to Shapefile, an import format dashDB supports. But QGIS would crash immediately on export. I think this is because Shapefiles don’t support hundreds of columns, not because QGIS failed, but I’m not sure.

So, I found an alternate route. I was able to output to GeoJSON, a format I could get into IBM Cloudant, which has a handy feature that lets you move your JSON database into a dashDB warehouse. Great! Now I had a plan. At this point, I had a 225,000 line, 2.5Gb JSON file that looked something like this:

{
  "FeatureCollection": {
    "features": [
      { "type": "Feature", "properties": { "B00001e1": 119, "B00002e1": 46, ... }
      { "type": "Feature", "properties": { "B00001e1": 359, "B00002e1": 169, ... }
      ...
    ]
  }
}

That file’s smaller than the 7.3Gb File Geodatabase I started with, but is still pretty big.

Import to Cloudant

To import this file to Cloudant, I reached for my colleague Glynn Bird’s handy couchimport tool, which had served me well in the past. But in this case, the program—or probably the NodeJS platform on which its written—gave me an out-of-memory error. So I split the file into smaller pieces, using another handy tool available on the Unix command line and aptly named split. I used 10,000-line files, creating chunks that were all under 100Mb. Problem was: splitting the file left me with invalid JSON, which was not OK with the import program. To fix it, I did some fancy shell scripting to create valid GeoJSON, and then import it into my Cloudant account.

Here’s the script and what it does:

  1. In a text file called census.json, I manually delete all the JSON surrounding the individual “feature” objects—namely the FeatureCollection and features array, leaving a text file consisting of comma-separated feature objects like this:

    { "type": "Feature", "properties": { "B00001e1": 119, ... }
    { "type": "Feature", "properties": { "B00001e1": 359, ... }
    ...
  2. I created a new sub-directory called sm, and used the Unix split command to create 10,000-line subfiles of census.json in it.

  3. The problem is, this isn’t valid JSON, and couchimport requires the JSON to validate to process it. To make each file valid GeoJSON, I loop through all the files in that directory and add { [ "features": [ back to the beginning of the file.

  4. I loop through all files again, replacing the last comma with] }, to close the features array and the object.

  5. Finally I loop through them all once more, running couchimport to move the data to Cloudant.

#!/bin/bash
# set these env vars at the shell
# COUCH_URL=https://$USERNAME:$PW@$ACCOUNT.cloudant.com
# COUCH_DATABASE=$DATABASE

mkdir sm
cd sm
split -l 10000 ../census.json
cd ..

FILES=./sm/*
for f in $FILES
do
  echo 'processing ' $f
  sed '1 i
  { "features": [
  ' $f > $f.f
  rm $f
done

FILES=./sm/*.f
for f in $FILES
do
  echo 'processing ' $f
  sed '$s/,$/ ] }/' $f > $f.cen
  rm $f
done

FILES=./sm/*

for f in $FILES
do
  echo 'importing ' $f
  cat $f | couchimport --type json --jsonpath "features.*" --parallelism 10
done

Figuring out how to replace the last comma in each file for Step 3 seems simple, but it had me scratching my head, scouring StackOverlfow, and searching my soul. The answer—at least on Mac OS X—is the ,$ in

sed '$s/,$/ ] }/' $f > $f.cen

Hope the next poor person searching for a solution finds this article!

Couchimport is best known as a great tool for importing CSV text files into Cloudant, but as you see, it also does a great job with JSON, and you don’t have to import the entire JSON file. With the

--jsonpath "features.*"

switch, I could select all the items in the features array for import.

Census data in the Cloudant dashboard

Sync with dashDB

This was the easiest part. Cloudant lets you transform its JSON into dashDB’s data warehousing format with just a few clicks. On the Cloudant dashboard, click Create a Warehouse, enter your IBM id, create a warehouse, and choose your tables to transfer.

Data Warehousing in the Cloudant dashboard

Or, if you already have a dashDB warehouse, you can convert from the dashDB side instead. From the dashDB dashboard, select Load > Sync from Cloudant.

Cloudant sync in the dashDB dashboard

What’s next

Extraction and transformation of this Census data has been quite a journey. But now that it’s done, you will soon benefit from the work. In the coming weeks, I’ll have updates on getting this data product into your own account, and using dashDB’s analysis tools to create some interesting maps and reporting projects. Until then, enjoy this population map of Cedar Rapids, Iowa I made in CartoDB using a small subset of data from this project.

Cedar Rapids, IA Population

Join The Discussion

Your email address will not be published. Required fields are marked *