Administrative Operations

JanusGraph logo

JanusGraph acts as an abstraction layer on top of the storage backends and defers to the storage backends for administrative best practices. As a result, there is a lack of centralized documentation on backend administrative tasks. I can’t cover them all here, but since this series is based off a containerized approach I’ll cover a Docker specific backup and restore procedure. The main use for this technique would be creating development copies of your instance. We’ll also see how to interact with Cassandra from within the container and locate the Docker volumes where our data is stored.

Viewing the databases in cqlsh

JanusGraph stores data in a non human readable format. While technically we can use cqlsh to look under the hood, all we’re really going to see is something along the lines of, “Yeah, that’s probably an engine”. Some users have created standalone java applications to get a better look at the tables, but at the time of this writing there is no way to get a better look using either cqlsh or the Gremlin Console.


While we can’t use cqlsh to directly view the data, we can use it see our database or keyspace as it’s called in Cassandra. In the next step, we’ll be taking a back-up and it’s not a bad idea to verify the database exists where you think it is, prior to attempting to back it up.

$ docker exec -it jg-cassandra sh -c 'exec cqlsh'
Connected to Test Cluster at
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> describe keyspaces

system_auth                     system_schema  system     system_distributed
"ConfigurationManagementGraph"  janusgraph     airroutes  system_traces

cqlsh> select * from system_schema.keyspaces;

 keyspace_name                | durable_writes | replication
                   janusgraph |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
                  system_auth |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
                system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
           system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
                       system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
                system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}
                    airroutes |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
 ConfigurationManagementGraph |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}

(8 rows)

Performing a back-up and restore operation

Cassandra Container backup and restore

Once again, backend database operations are going to be tied to whatever storage backend (Cassandra in our case) was implemented and is not JanusGraph specific. Our data and composite indexes will be stored on the storage backend. Mixed indexes are stored entirely separately in their respective backend, be it Elasticsearch, Solr, or Lucene. Only the reference to the mixed index is stored on the storage backend. I originally wanted to cover the various ways to perform back-ups and data migrations with Cassandra. Unfortunately, this post has already ballooned out of proportion and the Cassandra backup and restore procedures are less than straightforward. As a result, I’m going to give a brief overview on the options and then cover some basics for copying Docker images. If interest is expressed, I can always create a separate post on data backups and migration for JanusGraph on Cassandra as it would easily constitute it’s own write-up.

For the official methods to backup Cassandra, we should refer to their docs on backing up and restoring data. This mainly covers using snapshots to backup and restore. You can also the COPY tool for data migration. Another option for getting data onto new nodes is to simply have them join the cluster. Be aware that if you’re restoring to a cluster with a different number of nodes, extra work is involved.

Docker back-up and restore

To backup our Cassandra Docker container first thing we’ll need to do is commit it. I choose to name my commit as airroutes_backup.

$ docker commit jg-cassandra airroutes_backup


Since the Cassandra container default is to use volumes, we’ll need to back-up and restore the volume separately from the container. Please be aware if you’re on mac, Docker runs inside of a VM and you’ll need to access the volume from inside that VM running Docker, behind the scenes. Here are the commands to create a screen session to the VM and find the volume info if you’re on a mac.

screen ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty

$ docker inspect -f "{{json .Mounts}}" jg-cassandra | jq .
    "Type": "volume",
    "Name": "c97bf7d1aea13b3e597b9e3423ca8a0af222494b006bdd339d32766483e64b5c",
    "Source": "/var/lib/docker/volumes/c97bf7d1aea13b3e597b9e3423ca8a0af222494b006bdd339d32766483e64b5c/_data",
    "Destination": "/var/lib/cassandra",
    "Driver": "local",
    "Mode": "",
    "RW": true,
    "Propagation": ""
$ tar -cvf /var/lib/docker/volumes/c97bf7d1aea13b3e597b9e3423ca8a0af222494b006bdd339d32766483e64b5c/_data /tmp/


Once we get the volume id from the commands above, we can export it into a tar file on our local disk so it can be restored to a new container.

$ mkdir /tmp/backup
docker run --rm -v /tmp/backup:/backup -v /var/lib/docker:/docker alpine:edge tar cfz /backup/airroute_data.tgz -C /docker/volumes/c97bf7d1aea13b3e597b9e3423ca8a0af222494b006bdd339d32766483e64b5c/_data/ .


The command above is a bit dense so lets take a moment to review what we’re doing. First, docker run is used to run a process in an isolated container. The -rf flags are saying to clean up the isolated container after the command finishes. The -v flag maps a directory. It’ll map /var/lib/docker from the Docker daemon, but we’ll still get our normal /tmp directory. Then we define the image we want to use which in this case is alpine:edge. Lastly, we run tar and pass the cfz flags to create (c) from file (f) and zip (z). Then provide the destination target, /backup/airroute_data.tgz, and the source we got from the docker inspect command. The -C flag changes directories so we can omit the path from our zip and the period is saying to copy everything from the directory we changed into.

If you are running in Linux, you should be able to just tar the source listed from the docker inspect command to wherever you want it backed up.

Now that we have our volume backed up we’ll verify our image backup was created and save it to a tar file.

$ docker images 
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
airroutes_backup    latest              e365e1e9adf4        2 minutes ago       340MB
solr                latest              49ae92f15a2c        4 weeks ago         843MB
janusgraph/server   0.3.0-SNAPSHOT      9c7caba170cb        4 weeks ago         1.32GB
$ docker save airroutes_backup > /tmp/airroute_back.tar


For transparency I’m going to admit that I just removed the image from my machine and then re-imported from the tar file. I have used this method multiple times in the past creating custom images for CI pipelines.

Chriss-MacBook-Pro-2:airroutes chupman$ docker rmi airroutes_backup
Untagged: airroutes_backup:latest
Deleted: sha256:e365e1e9adf46dda2a9e50e8e0110d7f53ead56eeb1baf2525fab34a0ed9d355
Deleted: sha256:66dd626ad459b3edb426c9814c4346804271b5ba4fc69b255532651be9896220
$ docker load < /tmp/airroute_back.tar e47bcb48cb80: Loading layer [==================================================>]  17.32MB/17.32MB
Loaded image: airroutes_backup:latest
Chriss-MacBook-Pro-2:airroutes chupman$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
airroutes_backup    latest              e365e1e9adf4        4 minutes ago       340MB


Now we can create a new container from the image we saved and imported. We’ll want to copy over the data from the volume before we start it up. Hence, we are using docker create instead of docker run.

docker create --name airroutes -e CASSANDRA_START_RPC=true -p 9160:9160 -p 9042:9042 -p 7199:7199 -p 7001:7001 -p 7000:7000 airroutes_backup


Once again we’ll want to get the data directory for the new container.

$ docker inspect -f "{{json .Mounts}}" airroutes | jq .
    "Type": "volume",
    "Name": "bc43d2de2dd57eacb8850dcf1d39a37a2d876c75c115a5e6200ad83ccb758a1b",
    "Source": "/var/lib/docker/volumes/bc43d2de2dd57eacb8850dcf1d39a37a2d876c75c115a5e6200ad83ccb758a1b/_data",
    "Destination": "/var/lib/cassandra",
    "Driver": "local",
    "Mode": "",
    "RW": true,
    "Propagation": ""


Then we untar the backup into the new volume source and start up the container.

$ docker run --rm -it -v /tmp/backup:/backup -v /var/lib/docker:/docker alpine:edge tar xfz /backup/airroute_data.tgz -C /docker/volumes/bc43d2de2dd57eacb8850dcf1d39a37a2d876c75c115a5e6200ad83ccb758a1b/_data/
$ docker start airroutes


If everything went well you should be able to start up your JanusGraph server and see the same the same data you had before.

If you’re on mac, and want to look around the Docker VM you can start a screen session. You’ll be able to find the volumes by browsing to /var/lib/docker/volumes.

screen ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty

Elasticsearch backup and restore

Since exporting Docker data is pretty much the same for all containers, for the Elasticsearch container backups, I want to show a different way to export data. To take snapshots and production style backups I would recommend referring to the official Elasticsearch documentation on backups. If you aren’t working with a production environment or just want to make a copy for development, I’m going to propose my favorite quick and easy method.

When I’ve had to create copies of an Elasticsearch index to work with for development in the past I’ve used the reindex API. You can use reindex to clone an index to another instance of Elasticsearch. You could then update the field in your properties file to point to the copy.

Here’s an example curl command that clones the janusgraph_quadtextstring index from some-other-host to localhost. This is assuming that some-other-host has already been whitelisted.

curl -XPOST 'localhost:9200/_reindex?pretty' -H 'Content-Type: application/json' -d'
{ "source": {"remote":{ "host": "http://some-other-host:9200", "username": "my_username", "password": "my_password" }, "index": "janusgraph_quadtextstring" }, "dest": {"index": "janusgraph_quadtextstring" } }'

Wrapping Up

From an Ops perspective, the blessing of JanusGraph is that you can leverage existing knowledge of popular databases, like Cassandra. The curse is that if you don’t have any existing knowledge in the supported backends then you have a lot of moving parts to learn about. From previous projects I’ve worked on I already had Cassandra and Elasticsearch exposure which was a great help. Unfortunately for me, there are a number of users at IBM that use HBase and Solr through either Apache Ambari or standalone. I previously lacked experience with HBase, Solr, and Ambari and had to learn all three to properly provide support. If by chance you use Ambari or Hortonworks Data Platform and would be interested in a JanusGraph plugin please let me know in the comments. If you’d like to look back at other posts in this series or other JanusGraph articles I’ve written you can find them here. Also once I wrap up the finishing touches on my demo app part 4 – Visualization will be coming out.

JanusGraph Post Series – Chris Hupman

1 comment on"Getting started with JanusGraph, Part 3 – Administrative Operations"

  1. regarding Part 4, i recommend you can use graphexp to visualize your graph. I also found graphexp in docker hub which is
    If you can just docker run it and visual the graph, that would be very great.
    In addition, I hope you can post a new part to introduce how to optimize the performance of JG when the dataset is quite large, such as 5 million nodes. Thanks

Join The Discussion

Your email address will not be published. Required fields are marked *