Getting started with JanusGraph

JanusGraph is a scalable transactional property graph database. A property graph is a mapping between entities (which are referred to as vertices) by their relationships (which are referred to as edges). For example, in a property graph of JanusGraph developers I am a vertex with the property mappings of role: developer and org: IBM. I have a colleague, Jason, who is also a vertex with the same properties. In the image below, there is also a vertex for a some code updates I submitted, called a Pull Request. I have an edge connecting myself with the pull request named submitted and Jason is connected to the Pull Request by an edge named reviewed_by. If you did a query for the names of vertices connected to the Pull request with outbound edges you would see that it was Created_by me and Reviewed_by Jason. If you queried for edges coming in you would see it was Merged by Jason and Submitted by me. You could also query for edges going in both directions and get all four. Property graph queries can traverse multiple edges and vertices to help illuminate the relationships between entities. You can find additional articles on JanusGraph here. Also if you are interested in using a cloud hosted, production ready, JanusGraph instance be sure to check out Compose for JanusGraph.

Before coming to JanusGraph, I had a background with relational databases. My goal is to share some of the insights I gained during my adoption of property graphs. This is going to be a three-part series with this post covering deploying JanusGraph against Cassandra and Elasticsearch containers and then loading in an existing graphml dataset. Part 2 will go over indexes, administrative operations, and an intro to traversals. Part 3 will touch on visualization.

Typical Architecture for a JanusGraph Deployment

The main components to focus on in the image below are highlighted in orange. The storage backend is pluggable and supports Cassandra, HBase, BerkeleyDB, Google BigTable and an in-memory storage option. The storage backend is where the data is actually stored. Given its flexibility to work with numerous database engines, it allows you to pick an option that you might already have deployed or possess expertise in. You can only have one storage backend.

Next for the external index backend, Elasticsearch, Solr, and Lucene are supported. The external index backend is optional, but required for indexing on multiple properties, full text and string searches, and geo-mapping. Once again you can only select one. Since the data set we’ll be using includes coordinates, geo-mapping support will be useful.

After the storage backends, the TinkerPop API – Gremlin box represents how you can interact with the graph. It’s commonly know as the Gremlin Console and is an example of an application that invokes the TinkerPop API. This is the command line interface we’ll be using to interact with JanusGraph.

Finally the large orange box in the middle of the image represents the JanusGraph Server. This piece is a bit confusing at first since you run it with a script named gremlin_server.sh. Gremlin Server is a part of the Apache TinkerPop project. JanusGraph essentially acts as a plugin for Gremlin Server and tells it how, and where, to store graph data.

JanusGraph Architecture Overview

How to Deploy JanusGraph

For the first part of this series, we’re going to deploy Cassandra and Elasticsearch in Docker containers. Then we’ll configure our JanusGraph server to use our containers. After that we’ll use the Gremlin Console to connect to our JanusGraph server. Lastly we’re going to actually load up some data, commit it as a transaction and verify it loaded correctly. I’ll also cover some handy commands and things to look out for.

Backend Storage Containers

The first thing to do is set up our Docker containers for Cassandra and Elasticsearch. Cassandra will be used to actually store our data and Elasticsearch will allow us to index on multiple property keys, also known as a mixed index.

Cassandra Configuration

If you want to look at the available versions and useful commands for gettings started you can go to the Docker Hub for Cassandra

It’s important to expose port 9160 for Cassandra Thrift. If it’s not accessible, JanusGraph won’t be able to connect. Also we’re specifying version 3.11 because it’s supported by JanusGraph 0.2.0. Run the command below to download and run the Cassandra container. Note this will map ports 7000, 7001,7199, 9042 and 9160 on your local machine. If you are planning to only use this container for following along with this post you can get by with just port 9160.

docker run --name jg-cassandra -d -e CASSANDRA_START_RPC=true -p 9160:9160 -p 9042:9042 -p 7199:7199 -p 7001:7001 -p 7000:7000 cassandra:3.11

JanusGraph uses Cassandra Thrift to connect. Above we set an environment variable to enable it with -e CASSANDRA_START_RPC=true. There’s two other ways to enable Thrift. First we can update the Cassandra config file to set start_rpc to true and restart the container.

docker exec -it jg-cassandra sed -i 's/start_rpc: false/start_rpc: true/' /etc/cassandra/cassandra.yaml
docker restart jg-cassandra

The other option is to use nodetool to enable Thrift

docker exec -it jg-cassandra nodetool enablethrift

To verify the container is up and running you can run a docker ps.

docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                                                                      NAMES
cd3d675b54ed        cassandra:3.11      "docker-entrypoint.s…"   17 seconds ago      Up 2 seconds        0.0.0.0:7000-7001->7000-7001/tcp, 0.0.0.0:7199->7199/tcp, 0.0.0.0:9042->9042/tcp, 0.0.0.0:9160->9160/tcp   jg-cassandra

You can also use nodetool to verify that Thrift is running.

docker exec -it jg-cassandra nodetool statusthrift
running

Elasticsearch Deployment for JanusGraph

Once again we’ll be going to Docker Hub, but this time for the Elasticsearch container. Once again we’re specifying a version, this time 5.6, to be consistent with what is supported in the 0.2.0 release. Run the command below to download and run the Elasticsearch container. Note this will map ports 9200 and 9300 on your local machine.

docker run --name es -d -p 9200:9200 -p 9300:9300 elasticsearch:5.6

To verify it’s running you can either open your browser to http://localhost:9200 or run a curl from the command line.

curl localhost:9200
{
  "name" : "9XjBwtd",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "Qt6r2smoSGuXo0PLB_Bfvg",
  "version" : {
    "number" : "5.6.8",
    "build_hash" : "688ecce",
    "build_date" : "2018-02-16T16:46:30.010Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

Also it’s good to check by running docker ps as well.

docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                                                                      NAMES
cd3d675b54ed        cassandra:3.11      "docker-entrypoint.s…"   17 seconds ago      Up 2 seconds        0.0.0.0:7000-7001->7000-7001/tcp, 0.0.0.0:7199->7199/tcp, 0.0.0.0:9042->9042/tcp, 0.0.0.0:9160->9160/tcp   jg-cassandra
479e743a4b25        elasticsearch:5.6   "/docker-entrypoint.…"   2 days ago          Up 13 minutes       0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp

Download JanusGraph 0.2.0

Here’s the JanusGraph release page. Download the janusgraph-0.2.0-hadoop2.zip file, download it directly using this link.

Configuring JanusGraph

The JanusGraph distribution has the benefit of providing us with sensible default settings. The only things we’re going to need to change are increasing the timeout for scripts and enabling the JanusGraph graph manager. Luckily a configuration file already exists with the necessary settings for the JanusGraph graph manager. We simply need to pass the file path as a parameter when we start the server. If you haven’t already, unzip the janusgraph zip file and change your directory to janusgraph-0.2.0-hadoop2. For increasing the timeout, you can either run the sed command below or open the conf/gremlin-server/gremlin-server-configuration.yaml file with an editor and change the scriptEvaluationTimeout value from 30000 to 180000.

unzip janusgraph-0.2.0-hadoop2.zip;
cd janusgraph-0.2.0-hadoop2;
sed -i "s/scriptEvaluationTimeout.*/scriptEvaluationTimeout: 180000/" conf/gremlin-server/gremlin-server-configuration.yaml;
./bin/gremlin-server.sh conf/gremlin-server/gremlin-server-configuration.yaml

*Note if don’t specify the gremlin-server-configuration.yaml file after gremlin-server.sh it instead loads the conf/gremlin-server/gremlin-server.yaml by default. The default configuration file still defaults to using local Cassandra and Elasticsearch instances, but will not dynamically support running multiple graph databases on a single JanusGraph Server.

Connect to the JanusGraph Server

In a new window start up a Gremlin Console

./bin/gremlin.sh

Then you’ll want to connect to the JanusGraph server and tell the console to send all commands to the remote server

gremlin> :remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/127.0.0.1:8182-[876749ea-af2e-491c-bb28-406a850d1525]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[876749ea-af2e-491c-bb28-406a850d1525] - type ':remote console' to return to local mode

Load Some Test Data For Your JanusGraph Deployment

I really like the airport route data set from my colleague Kelvin Lawrence’s book Practical Gremlin. You can download the GraphML file here. If you want to copy and paste the commands below without updating the file path save it to /tmp/.
First connect to the JanusGraph server with a session in a Gremlin Console as we did previously and set up the remote console. Run the commands in the example below to create a template for a database named airroutes. Then we’ll import the air-routes.graphml file we downloaded into our airroutes database and commit the transaction. The import can sometimes take more than 30 seconds, which is why we upped value for scriptEvaluationTimeout.

gremlin> map = new HashMap<String, Object>();
gremlin> map.put("storage.backend", "cassandrathrift");
gremlin> map.put("storage.hostname", "127.0.0.1");
gremlin> map.put("graph.graphname", "airroutes");
gremlin> ConfiguredGraphFactory.createConfiguration(new MapConfiguration(map));

gremlin> graph=ConfiguredGraphFactory.open("airroutes");

gremlin> graph.io(graphml()).readGraph('/tmp/air-routes.graphml');

gremlin> graph.tx().commit();

gremlin> g=graph.traversal();
gremlin> :set max-iteration 1000 

Verify the Test Data on Your JanusGraph Deployment

To get a list of graphs we have created use the command below:

gremlin> ConfiguredGraphFactory.getGraphNames()
==>airroutes

We’re going to define the connection to our airroutes database as graph and then setup a traversal as g based off of the graph variable. We’re ending with a query that shows us paths from SFO to JFK airports with 1 layover. SFO is set as the departure_airport variable and JFK is set to arrival_airport to make the script more reusable.

gremlin> graph = ConfiguredGraphFactory.open('airroutes')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[cassandrathrift:[127.0.0.1]], standard]
gremlin> g.V().values('code').count()
==>3619
gremlin> departure_airport="SFO"
==>SFO
gremlin> arrival_airport="JFK"
==>JFK
gremlin> g.V().has('code', departure_airport).repeat(out('route').simplePath()).times(2).has('code', arrival_airport).path().by('code').limit(5)
==>[SFO, ATL, JFK]
==>[SFO, DFW, JFK]
==>[SFO, DCA, JFK]
==>[SFO, TPA, JFK]
==>[SFO, LGB, JFK]

One thing you’ll notice is that it might take a few seconds for some of the Gremlin queries to run. The gremlin console has some built-in caching so subsequent runs of queries will generally be quicker. But now we’ve got a basic working deployment of JanusGraph to build on! In part 2 of this blog series, we’ll cover adding indexes to speed up our queries. And in part 3, we will access our graph database with a simple front-end application for visualization of the data.

JanusGraph Post Series – Chris Hupman

Learn more about JanusGraph

8 comments on"Getting started with JanusGraph, Part 1 – Deployment"

  1. Rajesh.Kartha May 29, 2018

    Thank you Chris! I found this really useful and am looking forward to the Part 2 and 3 in this series.

  2. I am looking forward to the part 2 and 3. Are they out yet?

  3. Chris_Hupman July 03, 2018

    I’m currently working on editing part 2 down to a reasonable size and hope to have it out soon.

  4. Is it possible to have another that will show how to handle much larger GraphMl files, please?

    • Chris_Hupman July 16, 2018

      I’m not currently working on anything regarding large GraphMl files, but I am working on a post with some generic JanusGraph tips that has a groovy script for importing CSV files that batches the commits to allow for large files.

  5. Chris_Hupman July 13, 2018

    Sorry for the wait. Part 2 is out. Also JanusGraph 0.2.1 was released today https://github.com/JanusGraph/janusgraph/releases/tag/v0.2.1

Join The Discussion

Your email address will not be published. Required fields are marked *