Glynn Bird | Published April 15, 2019
CloudData storesDatabasesObject Storage
Object storage is much cheaper per gigabyte than a database, is endlessly extensible, and makes a great choice for storing backups and archived data.
In this blog post I discussed storing time-series data in time-boxed Cloudant databases to allow recent data to be stored and older data to be archived and deleted. Now, this tutorial examines how to query data that is archived and backed up. The examples use data that is backed up to Object Storage using the IBM SQL Query service.
To complete this tutorial, you need the following tools:
This tutorial takes you approximately 45 minutes to complete, if you have already installed the prerequisites.
The following sections walk through getting the data into object storage, exploring the data, and converting the data to the Parquet format.
We’re using the official couchbackup tool to extract data from Cloudant. It is a Node.js application and is therefore installed using the npm command-line utility:
npm install -g @cloudant/couchbackup
You can then store our Cloudant URL as an environment variable:
export COUCH_URL="<your Cloudant URL goes here>"
Assuming you are backing up a database called mydb, you invoke couchbackup with the following commands:
# make a shallow backup of the database
couchbackup --db mydb --mode shallow > mydb.txt
Aa file called mydb.txt is created. Each line of that file is an array of JSON documents – one per document in the database:
Your first job is to remove the Design Documents from this file. The IBM SQL Query service is looking for flat JSON documents – those with top-level key values and not embedded objects. You can use the jq tool to filter out documents whose _id field starts with _design:
# remove design documents from the backup file
cat mydb.txt | jq -c 'map(select(._id | startswith("_design") | not))' > mydb2.txt
The above jq command is fed the file containing multiple arrays of objects. For each array, it iterates through each document and excludes those that are design documents. The output is fed to a second text file: mydb2.txt.
The final step is to extract each line of the file (each array of documents) to its own file. Use the split command-line tool:
# split each line of mydb2.txt into its own file
split -l 1 mydb2.txt
# remove the original files
rm mydb.txt mydb2.txt
You should now have a number of files (xaa, xab, xac etc) which can be bulk uploaded into a Cloud Object Storage bucket, in this case into a bucket called mydb.
Now you have your JSON files in Cloud Object Storage, accessing them from IBM Cloud SQL Query is q breeze. You can access the data as if it were a SQL database:
-- explore the data
FROM cos://eu-gb/mydb/ STORED AS JSON
STORED AS JSON
The output data is previewed in a table in the web page, which is helpful for exploring the data. Look at the data’s column headings and data types, as shown in the the following screen capture:
Before you do any serious data exploration, it’s useful to complete one last data conversion: convert the JSON data into “Parquet” format. Parquet is a compressed, column-oriented format that comes from the Hadoop project. Data in this format is ideal for ad-hoc querying as it has a smaller data size and faster query performance.
Converting data to Parquet is a one-off operation that you can complete by running a single Cloud SQL Query statement:
-- convert all data in 'mydb' bucket into Parquet format
-- to be stored in the 'mydbparquet' bucket
FROM cos://eu-gb/mydb/ STORED AS JSON
INTO cos://eu-gb/mydbparquet/ STORED AS PARQUET
This example keeps the raw JSON data and the derived Parquet files in different Cloud Object storage buckets for neatness.
The result of this operation isn’t a table of data in the UI. It simply writes its results to the destination bucket. Exploring the bucket in the Cloud Object Storage UI reveals the resultant objects, as shown in the following screen capture:
Notice that the 3.2MB of data is now only occupying 782KB of space as a Parquet file.
You can now direct queries towards the Parquet version of our data:
-- get top 10 populated countries below the equator
SELECT country, SUM(population)
FROM cos://eu-gb/mydbparquet/jobid=59463bdb-cd55-4df3-af7f-fcdf75fcccc8 STORED AS PARQUET
WHERE latitude < 0
GROUP BY 1
ORDER BY 2 DESC
You can see the result in the following screen capture:
After you get the hang of exporting Cloudant data to Object storage, converting it to Parquet and analyzing it with Cloud SQL Query, the whole operation can be scripted to run automatically. Your time-boxed Cloudant data can be archived periodically and added to your data lake for analysis.
Cloud SQL Query can do much more than outlined in this tutorial: check out its SQL Reference, Geospatial toolkit and Timeseries functions.
May 16, 2019
API ManagementArtificial intelligence+
March 19, 2019
This tutorial demonstrates how to use the Workload Scheduler service on the IBM Cloud platform to automatically schedule process execution…
Back to top