Introduction

Welcome to part 2 of JanusGraph tips and tricks. To provide some quick context, this series consists of overflow content from the Getting Started with JanusGraph series I wrote last year.

When part 1 of that series was published, I only had about 3 months under my belt. Since then I’ve become a maintainer on the project, attended multiple meet ups, spoke about JanusGraph a few times, and reviewed a small mountain of code. As a result, a lot of content and detail was added in during my edit passes of this “overflow” content. Hopefully, you find it helpful.

Troubleshooting indexes

When creating an index, if there are any stale management sessions or open transactions, the index might get stuck in the INSTALLED state. If you’re unfamiliar with the lifecycle of a JanusGraph index, there is a JanusGraph wiki pages that diagrams the index states and lifecycle.

To see all the open transactions you can run graph.getOpenTransactions().

gremlin> graph.getOpenTransactions()
==>standardjanusgraphtx[0x14ba9376]
==>standardjanusgraphtx[0x477aaf55]

To rollback all the transactions, you can run graph.getOpenTransactions().getAt(0).rollback() until all the open transactions are rolled back, or you could be elegant and write a loop to run it the correct number of times. I personally prefer pressing up and enter a few times over the extra typing.

To see if there are any stale management instances, you can run the getOpenInstances() method. This is also documented in the failure and recovery section of the JanusGraph docs. If you see multiple management instances open, you can use the forceCloseInstance() method as shown below. You can also access a specific management instance by it’s array index, lets say 0 for this example, by appending [0] or getAt(0) to the command.

gremlin> mgmt = graph.openManagement()
gremlin> mgmt.getOpenInstances()
==>0934f2eb69223-Chriss-MacBook-Pro-2-local2
==>0729845962091-remoteMachine1
gremlin> mgmt.forceCloseInstance('0729845962091-remoteMachine1')
gremlin> mgmt.commit()

Exporting to GraphML or GraphSON

In part 1 of the getting started with JanusGraph series I went over loading the air-routes data set created by Kelvin Lawrence. I created a demo app off this dataset, but to make my life easier it only supports the contiguous U.S.

So in order to create a new GraphML file that only includes the data I wanted, I had to create a subgraph that included all edges that connect to US vertices other than US-HI and US-AL. This traversal will require some scrolling and a lengthly explanation. To give proper credit, I heavily referenced this janusgraph-users post by Jason Plurad, my colleague and JanusGraph mentor.

Create the subgraph

First we can try to create a subgraph in TinkerGraph and then traverse the subgraph to make sure it looks good.

gremlin> sg = g.V().has('country', 'US').has('region',without('US-HI', 'US-AL')).outE('route').where(inV().has('region',without('US-HI', 'US-AL')).has('country', 'US')).subgraph('us').cap('us').next()
==>tinkergraph[vertices:558 edges:6292]
gremlin> sgt = sg.traversal()
gremlin> sgt.V().has('country', 'US').count()
==>558

To walk through whats happening in each step first we’re issuing g.V() to start a traversal containing all vertexes within the US, excluding Alaska and Hawaii (Both of which are lovely places). Then we look at all outgoing edges that have a route label with outE('route') that are connecting to the contiguous United States. Following that we create a subgraph named usAirports based off out outE edge step with subgraph('usAirports').

Next we use the cap step which creates a barrier that returns all the results of our traversal.

This can also be seen in the usage examples in the subgraph step documentation.

To return the actual results we need to use a terminal step, in this case we use next() to return the results from the cap step.

Export to GraphML

Here’s the command appended with an export to GraphML.

g.V().has('country', 'US').has('region',without('US-HI', 'US-AL')).outE('route').where(inV().has('region',without('US-HI', 'US-AL')).has('country', 'US')).subgraph('us').cap('us').next().io(IoCore.graphml()).writeGraph("/tmp/airroutes_us.graphml")

We can also append the write operations to the subgraph we saved to sg.

sg.io(IoCore.graphml()).writeGraph("/tmp/airroutes_us.graphml")

Export to GraphSON

Exporting to GraphSON requires a few more steps. I ultimately used this Stack Overflow answer posted by Stephen Mallete, to figure it out.

I also referenced the TinkerPop GraphSON documentation.

sg = g.V().has('country', 'US').has('region',without('US-HI', 'US-AL')).outE('route').where(inV().has('region',without('US-HI', 'US-AL')).has('country', 'US')).subgraph('us').cap('us').next()
file = new FileOutputStream("/tmp/airroutes_us.json")
mapper = GraphSONMapper.build().addCustomModule(org.janusgraph.graphdb.tinkerpop.io.graphson.JanusGraphSONModuleV2d0.getInstance()).create()
writer = GraphSONWriter.build().mapper(mapper).create()
writer.writeGraph(file, sg)

It’s hard to explain this without going into object oriented programming and class inheritance, but I’ll do my best. The file variable is ultimately used to tell the graph writer to stream the data to a file named /tmp/airroutes_us.json. TheGraphSONMapper which we assign to mapper is registering the module we’ll use to serialize the graph elements.

In this case, it’s JanusGraphSONModuleV2d0. The GraphSONWriter object assigned to writer writes (surprise!) a graph and its elements to a JSON-based representation.

Something important to note is that unless a mapper is supplied complex objects, like GeoShape, it will get converted to Strings. The last line is telling the writer to write the sg property, or in our case subgraph, to file.

As I was typing the explanation of this one liner for creating the subgraph I had a thought. What would I be most wary of having to explain 6 months after writing it, a regex, a super dense Awk script, or a Gremlin traversal? To which I chose all of the above.

Feature preview printSchema

When I tried to import the GraphML file I created above, I was greeted with an error that the airport vertex label was missing. I could try to figure out how to include the labels in my export, but I want to create my schema and indexes before I import my data anyways.

Also this helps to highlight why GraphSON is recommended format to use. To help with the schema creation, a new set of commands are being added to the management API and will be available in versions 0.2.3, 0.3.2, and 0.4.0. These new commands are heavily derived from the schema describe script written by Robert Dale.

Unfortunately it can’t be used with ConfiguredGraphFactory, so I had to compile a distribution from the 0.3 branch to pull in the new printSchema() command.

gremlin> mgmt.printSchema()
==>------------------------------------------------------------------------------------------------
Vertex Label Name              | Partitioned | Static                                             |
---------------------------------------------------------------------------------------------------
version                        | false       | false                                              |
airport                        | false       | false                                              |
country                        | false       | false                                              |
continent                      | false       | false                                              |
---------------------------------------------------------------------------------------------------
Edge Label Name                | Directed    | Unidirected | Multiplicity                         |
---------------------------------------------------------------------------------------------------
route                          | true        | false       | MULTI                                |
contains                       | true        | false       | MULTI                                |
---------------------------------------------------------------------------------------------------
Property Key Name              | Cardinality | Data Type                                          |
---------------------------------------------------------------------------------------------------
dist                           | SINGLE      | class java.lang.Integer                            |
coords                         | SINGLE      | class org.janusgraph.core.attribute.Geoshape       |
code                           | SINGLE      | class java.lang.String                             |
type                           | SINGLE      | class java.lang.String                             |
desc                           | SINGLE      | class java.lang.String                             |
country                        | SINGLE      | class java.lang.String                             |
longest                        | SINGLE      | class java.lang.Integer                            |
city                           | SINGLE      | class java.lang.String                             |
elev                           | SINGLE      | class java.lang.Integer                            |
icao                           | SINGLE      | class java.lang.String                             |
lon                            | SINGLE      | class java.lang.Double                             |
region                         | SINGLE      | class java.lang.String                             |
runways                        | SINGLE      | class java.lang.Integer                            |
lat                            | SINGLE      | class java.lang.Double                             |
---------------------------------------------------------------------------------------------------
Vertex Index Name              | Type        | Unique    | Backing        | Key:           Status |
---------------------------------------------------------------------------------------------------
byCodeComposite                | Composite   | false     | internalindex  | code:         ENABLED |
---------------------------------------------------------------------------------------------------
Edge Index (VCI) Name          | Type        | Unique    | Backing        | Key:           Status |
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Relation Index                 | Type        | Direction | Sort Key       | Order    |     Status |
---------------------------------------------------------------------------------------------------

In case you only want some of the information, instead of running mgmt.printSchema(), you can get the output of the individual commands which are drumroll mgmt.printVertexLabels(), mgmt.printEdgeLabels(), mgmt.printPropertyKeys(), and mgmt.printIndexes().

Codify the schema and index creation

To create the schema, I just went down the line adding in everything I saw in printSchema(). For the indexes, I directly copied the code blocks from the getting started series. This script could also be run before the full import so re-indexing would no longer be required.

If you wanted to have it be idempotent and only try to create the schema, if it doesn’t already exist, you could wrap everything after mgmt = graph.openManagement() in an if block. You just need to check to see if one of the elements was already created, and use the result as your conditional.

Whether you choose to use GraphML or GraphSON, you should create your schema and indexes prior to importing. Lastly, you can get the benefit of type checking when you create your schema prior to import. Just set schema.default = none, disabling automatic type creation, so mismatches will fail. This is discussed under batch loading documentation.

:remote connect tinkerpop.server conf/remote.yaml session
:remote console
graph = ConfiguredGraphFactory.open("airroutes")

g = graph.traversal()

// Create graph schema and indexes, if they haven't already been created
mgmt = graph.openManagement()

// Create vertex Labels
mgmt.makeVertexLabel('version').make();
mgmt.makeVertexLabel('airport').make();
mgmt.makeVertexLabel('country').make();
mgmt.makeVertexLabel('continent').make();
// Create edge labels
mgmt.makeEdgeLabel('route').multiplicity(MULTI).make();
mgmt.makeEdgeLabel('contains').multiplicity(MULTI).make();
// Create property keys
dist = mgmt.makePropertyKey('dist').dataType(Integer.class).cardinality(Cardinality.SINGLE).make();
coords = mgmt.makePropertyKey('coords').dataType(org.janusgraph.core.attribute.Geoshape).cardinality(Cardinality.SINGLE).make();
code = mgmt.makePropertyKey('code').dataType(String.class).cardinality(Cardinality.SINGLE).make();
type = mgmt.makePropertyKey('type').dataType(String.class).cardinality(Cardinality.SINGLE).make();
desc = mgmt.makePropertyKey('desc').dataType(String.class).cardinality(Cardinality.SINGLE).make();
country = mgmt.makePropertyKey('country').dataType(String.class).cardinality(Cardinality.SINGLE).make();
longest = mgmt.makePropertyKey('longest').dataType(String.class).cardinality(Cardinality.SINGLE).make();
city = mgmt.makePropertyKey('city').dataType(String.class).cardinality(Cardinality.SINGLE).make();
elev = mgmt.makePropertyKey('elev').dataType(String.class).cardinality(Cardinality.SINGLE).make();
icao = mgmt.makePropertyKey('icao').dataType(String.class).cardinality(Cardinality.SINGLE).make();
lon = mgmt.makePropertyKey('lon').dataType(String.class).cardinality(Cardinality.SINGLE).make();
region = mgmt.makePropertyKey('region').dataType(String.class).cardinality(Cardinality.SINGLE).make();
runways = mgmt.makePropertyKey('runways').dataType(String.class).cardinality(Cardinality.SINGLE).make();
lat = mgmt.makePropertyKey('lat').dataType(String.class).cardinality(Cardinality.SINGLE).make();
// Create indexes
mgmt.buildIndex('byCodeComposite', Vertex.class).addKey(code).buildCompositeIndex();
mgmt.buildIndex('coordsIndex', Vertex.class).addKey(coords, Mapping.PREFIX_TREE.asParameter()).buildMixedIndex("search");
mgmt.buildIndex('quadTextString', Vertex.class).addKey(code, Mapping.TEXTSTRING.asParameter()).addKey(icao, Mapping.TEXTSTRING.asParameter()).addKey(desc, Mapping.TEXTSTRING.asParameter()).addKey(city, Mapping.TEXTSTRING.asParameter()).buildMixedIndex("search");
mgmt.commit();
println 'waiting for byCodeComposite index to be ready';
mgmt.awaitGraphIndexStatus(graph, 'byCodeComposite').call();
println 'waiting for coordsIndex index to be ready';
mgmt.awaitGraphIndexStatus(graph, 'coordsIndex').call();
println 'waiting for quadTextString index to be ready';
mgmt.awaitGraphIndexStatus(graph, 'quadTextString').call();
println 'created schema';

If you are using JanusGraphFactory instead of ConfiguredGraphFactory just remove the :remote lines and update the graph variable to point to a properties file.

graph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties")

Import the GraphML or GraphSON file you just created

GraphSON import

While exporting was a bit harder, importing the GraphSON file was straightforward.

graph.io(graphson()).readGraph('/tmp/airroutes_us.json');

GraphML import

Even after creating the schema, GraphML was not straightforward to import. Since Geoshape is a JanusGraph predicate it’s not supported by GraphML and gets exported as a Sting. For a workaround it was easier to remove the coords property from the GraphML file with some sed commands. To make the sed regex non greedy, I found this blog by Christoph Sieghart via a stack overflow questions, which was extremely helpful. Without it I would have had to resort to writing a Perl or Python script.

sed -ir 's/<data key="coords">POINT[^>]*>//g' /tmp/contigous_us_airroutes.graphml
sed -ir 's/<key id="coords" for="node" attr.name="coords" attr.type="string"\/>//g' /tmp/contigous_us_airroutes.graphml

In case you aren’t familiar with sed -i is for in-place and will update the file you’re working with. The -r flag is to enable extended regular expressions. I’m using it in this case for the [^>]* statement.

To re-populate the coords property I just ran the command I used back in part 2.

g.V().local( __.as('x').map{ Geoshape.point(it.get().values('lat').next(), it.get().values('lon').next()) }.as('coords').select('x').property('coords', select('coords')) ).iterate()

In case it wasn’t obvious at this point, you should use GraphSON if you can.

Conclusion

Working with graph databases is an iterative process. You want to be able to fail fast and restart quickly when you’re trying something new. So write reusable scripts for imports and schema creation. As someone who has had import jobs run for an entire day only to discover that a .next() step was missing, I have another recommendation.

If you have a large dataset to ingest, use a small subset of the data to test out your import scripts. Also save those scripts in git or some other versions control system, and use feature branches while troubleshooting. If you had to do a search or reference documentation to write the traversal, be kind to future you and write a comment about what’s going on.

Remember, open source is a community effort. Reach out for help and join the conversation in whichever format you prefer, including: janusgraph-users (usually this is where you’ll find me), gremlin-users, Stack Overflow, gitter, or even github if you want to help contribute.