Think 2021: New tools have the developer ecosystem and IBM building together Learn more

Learn about Extract, Transform, and Load (ETL)

This tutorial describes the analysis of United Kingdom crime data from inception to final results. Follow along to learn about data download, data transformation, loading into a distributed data warehouse, Apache Hive, and subsequent analysis using Apache Spark. Part 1 of this series describes the Extract, Transform, and Load (ETL) activities, and Part 2 explores the analyses performed.

Spark and Hive as alternatives to traditional ETL tools

Many ETL tools exist, but often require programmers to be familiar with proprietary architectures and languages. Apache Hive is a cloud-based data warehouse that offers SQL-based tools to transform structured and semi-structured data into a schema-based cloud data warehouse. Both Hive and Spark work with data within the Hadoop ecosystem and Hadoop Distributed File System (HDFS). Spark and the SparkSQL component are well suited to fast transformations of data within the data warehouse. Given the superior performance of SparkSQL compared with MapReduce, this provides a speed improvement over traditional ETL tools.

Figure 1. U.K. crime data download
Image shows U.K. crime data download

Extracting data from the U.K. Police database

The U.K. crime database was chosen to illustrate the speed and ease of use of Spark analytics with the Hive data warehouse. The program code and scripts for this tutorial are on GitHub.

There is more than one method to retrieve data from the U.K. crime database. The API download is suitable for analysis of local areas only. For this tutorial, use the CSV file download to analyze national data. This also lets you take advantage of a nice Hive feature during data load.

The figure below shows the data download from the U.K. crime dataset. I will illustrate the data route by selecting just London, but you may choose as many as you wish. I selected the Metropolitan Police Service (London) from September 2014 to August 2017.

On the custom download tab:

  • Select the date range of September 2014 through August 2017.
  • Select the forces Cambridgeshire Constabulary and Metropolitan Police. You can choose as many counties as you’d like, but for illustration purposes in this tutorial, only two counties are used. Scroll down and select the Include stop and search checkbox.
  • Click the Generate file option. It takes a few seconds to generate the file.
  • When complete, click Download now to download the data file.

Repeat this process for the street crime checkbox and the outcomes checkbox. This results in three files, each containing all the data for street crimes, outcomes, and stop and search. The figure below shows the three available dataset checkboxes available for download.

Figure 2. U.K. crime data download
image shows U.K. crime data download

After the download completes, you can find the file in your Download folder. Double-click the file to open the standard unzip utility on your computer. Create a folder where you want the unzipped files to live and name it UK-Crime.

The next step is cleaning the data and subsequently loading the data into Hive. Create a folder and name it staging. This folder is for the cleaned data to load into Apache Hive.

Data cleaning

The initial data download, when unzipped, results in a directory hierarchy with files stored in numerous subdirectories. Hive has a useful data load feature that ingests every file in a given directory when each file has the same structure (for example, the same column types). However, the data files provided by the U.K. Police use several file types in a directory hierarchy, by month and year, and by county. Therefore, the first step is to create a staging folder for each of the different file types.

To simplify this process, I created a Java™ program that parses the directory hierarchy and moves each file of a given type to the custom staging area for that file type. This simple Java utility program traverses the folder hierarchy of the unzipped data files and moves all the files of a given type into the staging area folder I created. This program copes with minor changes in the directory hierarchy that have been implemented by the U.K. police over time. All files in a staging area folder load into Apache Hive Warehouse in one step.

public class MoveFilesUtility {
    public static void main(String[] args) {
       //String mypath = "/home/hduser/zdata/uk‑crime/crime‑stop‑search";
       String mypath = "/home/hduser/zdata/uk‑crime/crime‑street‑crime";
       MoveFilesUtility o1 = new MoveFilesUtility();

     public void listFilesAndFilesSubDirectories(String directoryName){

        String dest_dir = "/home/hduser/zdata/uk‑crime/spark‑lake/staging/";
        File directory = new File(directoryName);
        //get all the files from a directory
        File[] fList = directory.listFiles();
        for (File file : fList){
            if (file.isFile()){
                moveUkCrimeFilesToDir( dest_dir, file.getAbsolutePath());
            } else if (file.isDirectory()){
      Copy uk crime file to another directory ( for hive data load
      @param directoryNameDestination to be listed
    public void moveUkCrimeFilesToDir(String directoryNameDestination, String mypath){
           File afile =new File(mypath);
           if(!afile.isFile()) return;
           if(afile.renameTo(new File(directoryNameDestination + afile.getName()))){           
            System.out.println("File move successful!");
            System.out.println("File  failed to move!  " + afile.getName() );
        }catch(Exception e){

You can copy and paste the Java code from the listing above or pull the code from GitHub.

Apache Hive and ETL

Apache Hive is a distributed data warehouse system built to work on Hadoop. It is used to query and manage large datasets that reside in HDFS storage. Hive provides a mechanism to project structure onto the data in Hadoop and HDFS and to query that data using a SQL-like language called HiveQL (HQL). Because Hive provides SQL-based tools to enable easy data extract, transform, and load, it makes sense to use HQL scripts to load data into Hive.

Loading data into Apache Hive

With Hive, you can load all files in a given directory as long as they have the same data structure. For the street crime data, create a subdirectory named staging_area. Make a note of the path on your computer and use that path in the Hive script.

Defining Hive tables

The Hive HQL file below creates the Hive data tables and loads all the data files in the directory staging_area into the table.

For subsequent analysis, it helps to understand the nature of the police crime dataset and how the data is collected. Files made available each month include:

  • Crime and Anti-Social Behavior (ASB) or Street Crime File — Contains street crimes, such as robberies, assaults, criminal damage.
  • Police Outcomes File — Investigations often last months, and this file will contain data and updates on events months, and sometimes years, past.
  • Court Results — Matching police data submitted to the government is matched against Ministry of Justice court hearing outcomes. Some crimes can be reclassified as investigations proceed; for others, latitude and longitude data is not necessarily precise, such as where a victim is unable to say exactly where a crime took place. The exact location of a crime is approximate to protect victims’ identities. Usually, this is chosen from a list of 750,000 reference points closest to the crime.

Typical data integration issues

There is no unique identifier for crimes that run from police service through criminal prosecution service to the courts themselves. The police use a fuzzy matching process to match these. Data uploaded is a snapshot of time; crimes are often reclassified as different types or confirmed as a false report after investigation. Following are the output files from the system and file formats.

Table 1. Street Crime File
Fields Description
Reported by The force that provided data
Falls within The force that provided data
Latitude Latitude of crime
Longitude Longitude of crime
LSOA code Lower Layer Super Output Area code
LSOA name Lower Layer Super Output Area name
Crime type Standard Crime classification
Last_Outcome_category A reference to whichever of the outcomes associated with the crime occurred most recently
Table 2. Stop and Search File
Fields Description
Reported by The force that provided data
Falls within The force that provided data
Latitude Latitude of crime
Longitude Longitude of crime
LSOA code Lower Layer Super Output Area code
LSOA name Lower Layer Super Output Area name
Crime type Standard Crime classification
Last_Outcome_category A reference to whichever of the outcomes associated with the crime occurred most recently

This file contains data for outcomes reported within the past month. The crimes may have occurred many months prior.

Table 3. Outcomes File
Fields Description
Offense Reference ReferenceID
Outcome Date Date of outcome
Outcome Category Category of outcome

Run Hive SQL scripts and display results

A Hive SQL (HQL) script is just a series of Hive query language commands. They are the same ones you would use in the Hive shell. This script file will consist of the Hive commands in an external HQL script and returning a value.

create table if not exists stop_and_search (
searchType STRING, searchDate STRING, part_police_operation STRING, police_operation STRING, latitude DOUBLE, longitude DOUBLE, gender STRING, age_range STRING, self_define_ethnicity STRING, officer_defined_ethnicity STRING, legislation STRING, object_of_search STRING, outcome STRING, outcome_linked_to_object STRING, removal_clothing STRING
row format delimited fields terminated by ',' 
stored as textfile;
load data local inpath '/home/hduser/zdata/uk‑crime/spark‑lake/staging/' into table stop_and_search;

It is efficient to enter a set of Hive commands into a file. This permits reuse of common commands, which are often lengthy definitions. This file contains Hive table definitions, followed by instructions to load the data files into the Hive table. You invoke the Hive SQL script file from the bash command line by typing the following command at the bash prompt:
$hive -f load_master_stop_and_search2.sql.

The above script loaded 1,601 files into Hive. This file is available on GitHub. A similar script is available on GitHub for the outcomes data and for the street crime data. Next, we use SparkSQL to organize and catalog the data and sort the resulting set from largest to smallest. The SparkSQL program is written in the Java programming language using the Spark Java API. The code is built using Maven. The Maven Project Object Model (POM) file is available of GitHub. The program is invoked from the bash command line with spark-submit:

$mvn package
$usr/local/spark/bin/spark‑submit –class “DemoHive” ‑‑master local[4] target/demo‑hive‑1.0.jar

The first command (mvn package) builds to Java program and integrates the Spark and Hive API functions automatically. This results in a Java JAR file as output. The second command runs the program, which interrogates the Hive data warehouse, performs calculations, and displays the results. An example output run is shown below.

Figure 3. Java SparkAPI program output
Image shows Java SparkAPI program output


In this tutorial, we showed the data loading, followed by transformations within the warehouse to integrate disparate data sources. This provides greater traceability of data changes, which remains a key factor in ensuring data quality. Without data quality and traceability, you cannot trust the accuracy of subsequent analyses.

The design of our database files for the analysis was taken after the data load of different CSV files from different sources, and the decisions were taken after checks of the data from the initial data loads. These checks were performed in iterative manner within the data warehouse itself.