Object storage is much cheaper per gigabyte than a database, is endlessly extensible, and makes a great choice for storing backups and archived data.

In this blog post I discussed storing time-series data in time-boxed Cloudant databases to allow recent data to be stored and older data to be archived and deleted. Now, this tutorial examines how to query data that is archived and backed up. The examples use data that is backed up to Object Storage using the IBM SQL Query service.

Prerequisites

To complete this tutorial, you need the following tools:

  • An IBM Cloud account.
  • An IBM Cloudant database service provisioned within your IBM Cloud account. Make a note of the URL of your Cloudant service – we’ll need that later. Make sure you have a database created in your Cloudant service containing
  • An IBM Cloud Object Storage service provisioned within your IBM Cloud account. Create a bucket within this service making a note of the region and bucket name you have chosen.
  • An IBM Cloud SQL Query service provisioned within your IBM Cloud account.

Cloudant, Cloud Object Storage, and SQL schematic for analyzing data

Estimated time

This tutorial takes you approximately 45 minutes to complete, if you have already installed the prerequisites.

Steps

The following sections walk through getting the data into object storage, exploring the data, and converting the data to the Parquet format.

Step 1: Getting the Cloudant data onto Object Storage

We’re using the official couchbackup tool to extract data from Cloudant. It is a Node.js application and is therefore installed using the npm command-line utility:

npm install -g @cloudant/couchbackup

You can then store our Cloudant URL as an environment variable:

export COUCH_URL="<your Cloudant URL goes here>"

Assuming you are backing up a database called mydb, you invoke couchbackup with the following commands:

# make a shallow backup of the database
couchbackup --db mydb --mode shallow > mydb.txt

Aa file called mydb.txt is created. Each line of that file is an array of JSON documents – one per document in the database:

[{"_id":"1"...},{"_id":"2"...},{"_id":"1"...},....]
[{"_id":"101"...},{"_id":"102"...},{"_id":"103"...},....]
[{"_id":"201"...},{"_id":"202"...},{"_id":"203"...},....]
...

Your first job is to remove the Design Documents from this file. The IBM SQL Query service is looking for flat JSON documents – those with top-level key values and not embedded objects. You can use the jq tool to filter out documents whose _id field starts with _design:

# remove design documents from the backup file
cat mydb.txt | jq -c 'map(select(._id | startswith("_design") | not))' > mydb2.txt

The above jq command is fed the file containing multiple arrays of objects. For each array, it iterates through each document and excludes those that are design documents. The output is fed to a second text file: mydb2.txt.

The final step is to extract each line of the file (each array of documents) to its own file. Use the split command-line tool:

# split each line of mydb2.txt into its own file
split -l 1 mydb2.txt
# remove the original files
rm mydb.txt mydb2.txt

You should now have a number of files (xaa, xab, xac etc) which can be bulk uploaded into a Cloud Object Storage bucket, in this case into a bucket called mydb.

Step 2: Explore the data

Now you have your JSON files in Cloud Object Storage, accessing them from IBM Cloud SQL Query is q breeze. You can access the data as if it were a SQL database:

-- explore the data
SELECT *
FROM cos://eu-gb/mydb/ STORED AS JSON
  • SELECT * – choose all columns.
  • FROM cos://eu-gb/mydb/ – from Cloud Object storage, in the eu-gb geography from the mydb bucket.
  • STORED AS JSON – defines the file format of the stored data.

The output data is previewed in a table in the web page, which is helpful for exploring the data. Look at the data’s column headings and data types, as shown in the the following screen capture:

IBM SQL Query screen capture

Step 3: Converting to Parquet

Before you do any serious data exploration, it’s useful to complete one last data conversion: convert the JSON data into “Parquet” format. Parquet is a compressed, column-oriented format that comes from the Hadoop project. Data in this format is ideal for ad-hoc querying as it has a smaller data size and faster query performance.

Converting data to Parquet is a one-off operation that you can complete by running a single Cloud SQL Query statement:

-- convert all data in 'mydb' bucket into Parquet format
-- to be stored in the 'mydbparquet' bucket
SELECT *
FROM cos://eu-gb/mydb/ STORED AS JSON
INTO cos://eu-gb/mydbparquet/ STORED AS PARQUET

This example keeps the raw JSON data and the derived Parquet files in different Cloud Object storage buckets for neatness.

The result of this operation isn’t a table of data in the UI. It simply writes its results to the destination bucket. Exploring the bucket in the Cloud Object Storage UI reveals the resultant objects, as shown in the following screen capture:

Cloud Object Storage screen capture

Notice that the 3.2MB of data is now only occupying 782KB of space as a Parquet file.

You can now direct queries towards the Parquet version of our data:

-- get top 10 populated countries below the equator
SELECT country, SUM(population)
FROM cos://eu-gb/mydbparquet/jobid=59463bdb-cd55-4df3-af7f-fcdf75fcccc8 STORED AS PARQUET
WHERE latitude < 0
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

You can see the result in the following screen capture:

Parquet result screen capture

Summary

After you get the hang of exporting Cloudant data to Object storage, converting it to Parquet and analyzing it with Cloud SQL Query, the whole operation can be scripted to run automatically. Your time-boxed Cloudant data can be archived periodically and added to your data lake for analysis.

Cloud SQL Query can do much more than outlined in this tutorial: check out its SQL Reference, Geospatial toolkit and Timeseries functions.