IBM Developer Advocacy

Importing JSON Documents with nosqlimport



Glynn Bird
9/19/16

Two years ago, I started to write couchimport, a command-line line utility to allow me to import comma-separated and tab-separated files into a Apache CouchDB™ or Cloudant NoSQL database.

cat mydata.tsv | couchimport --db mydatabase

I built the tool for my own purposes but decided to share it publicly and open-sourced couchimport in case anyone else would find it useful. I learned a lot by writing this project:

  • publishing the project to npm (the Node.js package manager) allows other users to easily install the code — npm install -g couchimport
  • if your library/utility would save someone an hour of effort, then it’s worth open-sourcing it
  • using the Node.js Stream API allows your application to deal with file streams and HTTP streams interchangeably
  • a decent README is essential if you expect folks to use your software. In many cases the README.md file is the documentation for your project.

The purpose of couchimport is to write data to CouchDB in chunks using data from a text file:

couchimport

It also allows a transform function to be added to the workflow: this function gets called with each new document and can modify it to cast data types, remove unwanted fields or to reorganise the structure of the document. The project turned out to be a very useful way for folks to get started with CouchDB because pre-existing data is likely to be in a spreadsheet or a relational database which can easily be exported to CSV/TSV.

I also found I needed to use couchimport’s functionality programmatically, and so I exposed some of its functions to the world so that couchimport can be npm install-ed into anyone’s Node.js project. In fact, couchimport is the importer used in our Simple Search Service project.

Introducing nosqlimport

I recently refactored the couchimport code to make it work with other JSON document stores, so today I’m publishing nosqlimport:

nosqlimport-schematic

This can be be installed as a command-line utility:

npm install -g nosqlimport

Or as a library to be used in your own project:

npm install --save nosqlimport

On its own, nosqlimport only writes its data to the terminal, but it has three other optional npm modules that can be added for Apache CouchDB, MongoDB and ElasticSearch support:

npm install -g nosqlimport-couchdb
npm install -g nosqlimport-mongodb
npm install -g nosqlimport-elasticsearch

The type of database that is written to is defined by the --nosql or -n command line switch at run-time, e.g.:

cat movies.tsv | nosqlimport -n couchdb

Importing data into CouchDB

Firstly, define your CouchDB or Cloudant URL as an environment variable:

export NOSQL_URL=http://localhost:5984

Or:

export NOSQL_URL=https://myusername:mypassword@myaccount.cloudant.com

The CouchDB or Cloudant “database” to write data to can also be defined as an environment variable:

export NOSQL_DATABASE=mydatabase

Then import a text file:

cat movies.tsv | nosqlimport -n couchdb

If you’d prefer to supply all the details as command-line switches, then that’s possible too:

cat movies.tsv | nosqlimport -n couchdb -u https://myusername:mypassword@myaccount.cloudant.com -db mydatabase

Importing data into MongoDB

Firstly, define your MongoDB URL as an environment variable:

export NOSQL_URL=mongodb://localhost:27017/mydatabase

The MongoDB “collection” to write data to can also be defined as an environment variable:

export NOSQL_DATABASE=mycollection

Then import a text file:

cat movies.tsv | nosqlimport -n mongodb

If you’d prefer to supply all the details as command-line switches, then that’s possible too:

cat movies.tsv | nosqlimport -n mongodb -u mongodb://localhost:27017/mydatabase -db mycollection

Importing data into ElasticSearch

Firstly, define your MongoDB URL as an environment variable:

export NOSQL_URL=http://localhost:9200/myindex

The ElasticSearch “type” to write data to can also be defined as an environment variable:

export NOSQL_DATABASE=mytype

Then import a text file:

cat movies.tsv | nosqlimport -n elasticsearch

If you’d prefer to supply all the details as command-line switches, then that’s possible too:

cat movies.tsv | nosqlimport -n elasticsearch -u http://localhost:9200/myindex -db mytype

Specifying the delimiter

By default, nosqlimport expects text files with a tab character delimiting the columns in the text file, but this can be specified at run time by supplying a --delimiter or -d parameter:

cat movies.csv | nosqlimport -d ',' -n couchdb

Transform functions

Transform functions are entirely optional but are a very powerful way of modifying the JSON object before it is written to the database. You may need to:

  • cast data types to force strings to be numbers, or booleans prior to saving
  • remove some documents that don’t need saving in the first place
  • rearrange the JSON object e.g. generate a GeoJSON object from a text file of latitudes and longitudes

A transform function is saved to a text file before calling nosqlimport and contains a single JavaScript function exported via module.exports. The transform function is called for each row in the incoming text file (except the first line which contains the column headings), and the document it synchronously returns is added to the write buffer. For example, if our source data looked like this:

name latitude longitude description live
Middlesbrough 54.576841 -1.234976 A large industrial town on the south bank of the River Tees true
Boston 42.358056 -71.063611 The largest city in Massachusetts. true
Atlantis 0 0 A fictional island false

The documents being generated and passed to the transform function would look like this:

{
  "name": "Middlesbrough",
  "latitude": "54.576841",
  "longitude": "-1.234976",
  "description": "A large industrial town on the south bank of the River Tees",
  "live": "true"
}

Notice how:

  • the object’s keys were inferred from the incoming file’s first line
  • the values are all strings — because a CSV file doesn’t contain any sense of a column’s data type.

In this example, we cast the latitude and longitude values to numbers and force the live value to be a boolean:

module.exports = function(doc) {
  doc.latitude = parseFloat(doc.latitude);
  doc.longitude = parseFloat(doc.longitude);
  doc.live = (doc.live === 'true');
  return doc;
};

To prevent certain documents from being saved, then simply return {} instead of a populated object:

module.exports = function(doc) {
  if (doc.live === 'true') {
    return doc;
  } else {
    // nothing is written to the database
    return {}
  }
};

Or you can elect to craft a new JSON document in your own format based on the data being imported, in this case GeoJSON:

module.exports = function(doc) {
  if (doc.live === 'true') {
    var newdoc = {
      type: 'Feature',
      geometry: {
        type: 'Point',
        coordinates: [ parseFloat(doc.latitude), parseFloat(doc.longitude) ]
      },
      properties: {
        name: doc.name
      }
    };
    return newdoc;
  } else {
    return {};
  }
};

A transform function is used by supplying the path to the file containing the code using the -t parameter:

cat places.tsv | nosqlimport -n mongodb -t './geojson.js'

Using nosqlimport in your own application

If you are building a Node.js application and need to be able to import files of content, streams or HTTP streams into a NoSQL database, then you can use nosqlimport in your own project as a dependency. Add it to your project with:

npm install --save nosqlimport

Add the database-specifc module:

npm install --save nosqlimport-couchdb
npm install --save nosqlimport-mongodb
npm install --save nosqlimport-elasticsearch

And call the code:

var nosqlimport = require('nosqlimport');

// connection options
var opts = { nosql: 'couchdb', url: 'http://localhost:5984', database: 'mydb'};

// import the data
nosqlimport.importFile('./places.tsv', null, opts, function(err, data) {
    console.log(err, data);
});

Or, supply a JavaScript function to transform the data:

var nosqlimport = require('nosqlimport');

// cast lat/long to numbers and live to boolean
var transformer = function(doc) {
  doc.latitude = parseFloat(doc.latitude);
  doc.longitude = parseFloat(doc.longitude);
  doc.live = (doc.live === 'true');
  return doc;
};

// connection options
var opts = { nosql: 'couchdb', url: 'http://localhost:5984', database: 'mydb', transform: transformer};

// import the data
nosqlimport.importFile('./places.tsv', null, opts, function(err, data) {
    console.log(err, data);
});

Links

nosqlimport and its plugins are open-source projects, so please raise issues or contribute PRs if you can!

blog comments powered by Disqus