Understanding JanusGraph Indexes

JanusGraph has three types of indexes. They are composite, mixed, and vertex-centric. In this post we’ll be covering composite indexes and mixed indexes.

(Part 2 of a 4-part JanusGraph series of posts)

Composite Indexes

Composite indexes are essentially a hash table that maps a vertex property (value) to a vertex id (key). Matches have to be exact to use composite indexes. They are quite fast and help optimize the lookup for your traversal’s starting point when you don’t have the vertex id. In fact, if you run a query that does a full graph scan without hitting an index, JanusGraph will generate a warning.

Here is an example of the lookup portion of query from the end of Part 1.
You can that see running the query without an index results in an initial lookup around 175ms. Note that the profile() step adds overhead so the duration is approximate. Also note that after the query runs initially, the caches in JanusGraph and the storage layer are warm so subsequent runs will most likely perform better.

gremlin> :remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/127.0.0.1:8182-[62101a39-934e-46ff-b11a-9fcebb5a282f]
gremlin> graph=ConfiguredGraphFactory.open("airroutes");
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> g=graph.traversal();
==>graphtraversalsource[standardjanusgraph[cassandrathrift:[127.0.0.1]], standard]
gremlin> g.V().has('code', 'SFO').profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep([],[code.eq(SFO)])                                      2           2         175.756   100.00
    \_condition=(code = SFO)
    \_isFitted=false
    \_query=[]
    \_orders=[]
    \_isOrdered=true
  optimization                                                                                 0.032
  scan                                                                                         0.000
    \_condition=VERTEX
    \_query=[]
    \_fullscan=true
     >TOTAL                     -           -         175.756        -

 

Now that we have a baseline let’s create a composite index and run the query again. Here are the steps to create a composite index, named byCodeComposite, that will let us quickly locate the vertex id by the code property.

gremlin> graph.tx().rollback()
==>null
gremlin> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@19472803
gremlin> code = mgmt.getPropertyKey('code')
==>code
gremlin> mgmt.buildIndex('byCodeComposite', Vertex.class).addKey(code).buildCompositeIndex()
==>byCodeComposite
gremlin> mgmt.commit()
==>null
gremlin> mgmt.awaitGraphIndexStatus(graph, 'byCodeComposite').call()
==>GraphIndexStatusReport[success=true, indexName='byCodeComposite', targetStatus=[REGISTERED], notConverged={}, converged={code=REGISTERED}, elapsed=PT0.012S]

 

It’s best practice to set your indexes before you load data. In our case we already loaded our data so we’ll need to run a reindex procedure to populate the index. Here are the commands to reindex

gremlin> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@60f95d04
gremlin> mgmt.updateIndex(mgmt.getGraphIndex("byCodeComposite"), SchemaAction.REINDEX).get()
==>org.janusgraph.diskstorage.keycolumnvalue.scan.StandardScanMetrics@60553d2d
gremlin> mgmt.commit()
==>null

 

Now that the index is in place let’s see how our lookup times have improved.

gremlin> g.V().has('code', 'SFO').profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep([],[code.eq(SFO)])                                      2           2          11.221   100.00
    \_condition=(code = SFO)
    \_isFitted=true
    \_query=multiKSQ[1]@2147483647
    \_index=byCodeComposite
    \_orders=[]
    \_isOrdered=true
  optimization                                                                                 8.884
  backend-query                                                        2                       2.013
    \_query=byCodeComposite:multiKSQ[1]@2147483647
     >TOTAL                     -           -          11.221        -

 

As you can see, in the bold above, the query matched the byCodeComposite index and the duration decreased from 175ms to 11ms.

Mixed Indexes (or indices if you prefer)

If we want to support typeahead or partial string searches, we’ll need to use a mixed index. Also if you want to be able to show nearby airports as alternate options, we could make use of the geo-mapping feature. To use geo-mapping we would need to have a property containing the coordinates. From our origin coordinates we would tell JanusGraph to draw a circle with whatever radius we wanted for our range and return all vertexes within the circle.

Enable Mixed Indexes

We’ll need to shutdown our JanusGraph Server with a ctrl+c and make some changes at this point. Looking back, I realized that the setup in Part 1 doesn’t actually enable Elasticsearch so we’re going to need to create a properties file for our airroutes database and add it to our gremlin-server-configuration.yaml file allow the use of mixed indexes. When stopping the server with ctrl+c, JanusGraph will roll back any uncommitted transactions. If you just stop the container, you have a high likelihood of creating stale transactions or even stale management instances (which is bad).

First let’s add a reference to the new properties file we’re going to create in gremlin-server-configuration.yaml

host: 0.0.0.0
port: 8182
scriptEvaluationTimeout: 180000
channelizer: org.apache.tinkerpop.gremlin.server.channel.WebSocketChannelizer
graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
  ConfigurationManagementGraph: conf/janusgraph-cassandra-configurationgraph.properties,
  airroutes: conf/airroutes.properties
}
...

 

Now let’s copy the default properties file for Cassandra with Elasticsearch, conf/janusgraph-cassandra-es.properties

janusgraph-0.2.0-hadoop2 chupman$ cp conf/janusgraph-cassandra-es.properties conf/airroutes.properties

 

Open up the newly created airroutes.properties and add in the lines in bold

...
# The primary persistence provider used by JanusGraph.  This is required.
# It should be set one of JanusGraph's built-in shorthand names for its
# standard storage backends (shorthands: berkeleyje, cassandrathrift,
# cassandra, astyanax, embeddedcassandra, cql, hbase, inmemory) or to the
# full package and classname of a custom/third-party StoreManager
# implementation.
#
# Default:    (no default value)
# Data Type:  String
# Mutability: LOCAL
storage.backend=cassandrathrift
gremlin.graph=org.janusgraph.core.ConfiguredGraphFactory
graph.graphname=airroutes
...

Review the Mixed Index Query Parameters

Now that our airroutes graph is using Elasticsearch, I suggest taking a minute to read up on how we use mixed indexes. Here are the docs on the different text search options. To give a brief overview, TEXT is case insensitive but will tokenize or split up, all words into separate buckets. Next, an index with a type of STRING will do an exact, case-sensitive match, but will not split up your string. Finally, TEXTSTRING stores both together as a single index allowing you to search both types of indexes in a single query, but is currently only available when using Elasticsearch.

Currently there is no index option that will natively let you perform a case insensitive query on an entire string. That being said you can easily add an extra step to convert the query string into a regular expression that is case insensitive. If I want to search for San Jose International Airport and typed ‘san jo’ into the search box using textRegex no matches would be returned by default. If I have my application iterate through the string ‘san jo’ and convert each character into an upper case and lower case pair in brackets, like ‘[Ss][Aa][Nn] [Jj][Oo]’, I’ll get the desired results. If you are unfamiliar with regular expressions the bracket [Ss] is saying that a single character should match uppercase S or lower case s. There are a lot of tools that help troubleshoot regular expressions and I personally have regex101 bookmarked and use it fairly often.

Create the TEXTSTRING Index

Here are the steps to create a TEXTSTRING index named quadTextString. This index will include the code, icao, desc, and city properties. The icao and code properties will let us search the two different airport codes we have in our data. The city property lets us search by the name of the city that the airport resides in. Finally, the description property will have additional details we may wish to search on.

gremlin> graph.tx().rollback()
==>null
gremlin> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@19472803
gremlin> code = mgmt.getPropertyKey('code')
==>code
gremlin> icao = mgmt.getPropertyKey('icao')
==>icao
gremlin> desc = mgmt.getPropertyKey('desc')
==>desc
gremlin> city = mgmt.getPropertyKey('city')
==>city
gremlin>mgmt.buildIndex('quadTextString', Vertex.class).addKey(code, Mapping.TEXTSTRING.asParameter()).addKey(icao, Mapping.TEXTSTRING.asParameter()).addKey(desc, Mapping.TEXTSTRING.asParameter()).addKey(city, Mapping.TEXTSTRING.asParameter()).buildMixedIndex("search")
==>quadTextString
gremlin> mgmt.awaitGraphIndexStatus(graph, 'quadTextString').call() // Block until status moves out of INSTALLED. 
==>GraphIndexStatusReport[success=true, indexName='quadTextString', targetStatus=[REGISTERED], notConverged={}, converged={code=REGISTERED}, elapsed=PT0.012S]
gremlin> mgmt.commit()
==>null

 

Now that the index is created we’ll run a reindex procedure

gremlin> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@60f95d04
gremlin> mgmt.updateIndex(mgmt.getGraphIndex("quadTextString"), SchemaAction.REINDEX).get()
==>org.janusgraph.diskstorage.keycolumnvalue.scan.StandardScanMetrics@60553d2d
gremlin> mgmt.commit()
==>null

 

First, I’ll query our new index with .profile() appended so we can verify that our new quadTextString index is being used. If you don’t see the query using the index, scroll down to the “Index Status” section to learn how to see what state, the index is currently in.

gremlin> g.V().or(has('desc', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('code', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('icao', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('city', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*'))).propertyMap('code', 'desc').profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
Or(JanusGraphStep([],[desc.textRegex(.*[Ss][Aa]...                     6           6           0.487    76.61
    \_condition=((desc textRegex .*[Ss][Aa][Nn] [Jj][Oo].*) OR (code textRegex .*[Ss][Aa][Nn] [Jj][Oo
               ].*) OR (icao textRegex .*[Ss][Aa][Nn] [Jj][Oo].*) OR (city textRegex .*[Ss][Aa][Nn] [Jj][Oo].
               *))
    \_isFitted=false
    \_query=[((desc textRegex .*[Ss][Aa][Nn]  [Jj][Oo].*) OR (code textRegex .*[Ss][Aa][Nn] [Jj][Oo].*
           ) OR (icao textRegex .*[Ss][Aa][Nn] [Jj][Oo].*) OR (city textRegex .*[Ss][Aa][Nn] [Jj][Oo].*))]:qu
           adTextString
    \_index=quadTextString
    \_orders=[]
    \_isOrdered=true
    \_index_impl=search
  optimization                                                                                 0.133
PropertyMapStep([code, desc],property)                                 6           6           0.148    23.39
                                            >TOTAL                     -           -           0.636        -

 

Now we’ll run the query normally so we can see what results our typeahead would show after being converted into a case insensitive regex. If the query looks confusing, don’t worry, it will be explained in detail later on.

gremlin> g.V().or(has('desc', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('code', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('icao', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('city', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*'))).propertyMap('code', 'desc')	 	 
==>{code=[vp[code->SJC]], desc=[vp[desc->Norman Y. Mineta ...]]}
==>{code=[vp[code->SJE]], desc=[vp[desc->Jorge E. Gonzalez...]]}
==>{code=[vp[code->SJO]], desc=[vp[desc->San Jose, Juan Sa...]]}
==>{code=[vp[code->SYQ]], desc=[vp[desc->Tobias Bolanos In...]]}
==>{code=[vp[code->SJD]], desc=[vp[desc->Los Cabos Interna...]]}
==>{code=[vp[code->SJI]], desc=[vp[desc->San Jose Airport]]}

Create the Geospatial Index

Now to create the geospatial index we’ll create a coords property and index.

gremlin> graph.tx().rollback()
==>null
gremlin> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@5d7c66d8
gremlin> coords = mgmt.makePropertyKey('coords').dataType(Geoshape.class).make()
==>coords
gremlin> mgmt.buildIndex('coordsIndex', Vertex.class).addKey(coords, Mapping.PREFIX_TREE.asParameter()).buildMixedIndex("search")
==>coordsIndex
gremlin> mgmt.commit()
==>null

 

We’ll populate the coords property later by combining the lat and lon properties in the “Diving into Graph Traversals” section.

List the Indexes

List Composite Indexes

I wanted to link to the docs for the commands to list all the indexes on a graph, but it’s not currently documented. I had to dig a bit in the Javadoc API to find the answer, but I opened an issue and will probably work on it once I get back from my imminent paternity leave. The management class has a function named getGraphIndexes that takes a single parameter. It wants a class that extends an Element.

mgmt.getGraphIndexes(Vertex.class) // This is the one you'll want to run
mgmt.getGraphIndexes(Edge.class) // This would be for looking up vertex centric indexes

List Mixed Indexes

To view the indexes created on Elasticsearch you can browse to http://127.0.0.1:9200/_cat/indices?v . If you’re curious what that URL means you can check out the cat API page in the Elasticsearch docs. Basically we’re pointing to the Elasticsearch endpoint, telling it to output in a human readable format, specifying indexes(indices), and then passing ?v(verbose) to get headers to label the fields. If you see a health as yellow it’s because you’re not running enough nodes for data replication. If you clustered Elasticsearch with 2 additional nodes that would turn green.

To see the status of an index

The command below is a bit longer than strictly necessary, but it won’t run the risk of timing out. I’m explicitly naming all four index states so it’ll show me the state no matter what it is. In this case you can see all four properties are in the desired state, ENABLED.

mgmt.awaitGraphIndexStatus(graph, 'quadTextString').status(SchemaStatus.ENABLED, SchemaStatus.REGISTERED, SchemaStatus.DISABLED, SchemaStatus.INSTALLED).call()
==>GraphIndexStatusReport[success=true, indexName='quadTextString', targetStatus=[ENABLED, REGISTERED, DISABLED, INSTALLED], notConverged={}, converged={code=ENABLED, city=ENABLED, icao=ENABLED, desc=ENABLED}, elapsed=PT0.021S]

Diving into graph traversals

To get started, we’ll connect to the gremlin server and graph we created and populated in Part 1 and look up the vertex properties. You can find a description of the properties in Kelvin Lawrence’s graph book, or by opening and perusing the graphml file.

// Reminder on how to connect to the graph and setup traversals
gremlin> :remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/127.0.0.1:8182-[ac75ae6d-a9ee-485d-9368-f5038134d895]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[ac75ae6d-a9ee-485d-9368-f5038134d895] - type ':remote console' to return to local mode
gremlin> graph = ConfiguredGraphFactory.open('airroutes')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[cassandrathrift:[127.0.0.1]], standard]
// Now we can actually start looking at the data we loaded in Part 1, let's check the properties associated with a sample airport
// Note you can also see vertex properties in other formats by appending .valueMap(true) or .propertyMap() rather than .properties() 
gremlin> g.V().has('code', 'SFO').properties()
==>vp[code->SFO]  // This is the property we used to find the vertex
==>vp[type->airport]
==>vp[desc->San Francisco Intern]
==>vp[country->US]  // Country will be a very useful property for filtering
==>vp[longest->11870]
==>vp[city->San Francisco]
==>vp[elev->13]
==>vp[icao->KSFO]
==>vp[lon->-122.375]  // longitude coords
==>vp[region->US-CA]  // State the airport is contained within
==>vp[runways->4]
==>vp[lat->37.6189994812012]  // latitude coords

 

The map I’ll be using in the part 4 demo app only has the Contiguous United States. I wrote the query below to see how many airports I’ll be dealing with. It starts with g.V() getting all vertices in the graph. Then we filter with .has(‘country’, ‘US’) To only show airports in the US. Finally, .not(has(‘region’, within(‘US-AL’, ‘US-HI’))) filters out airports in Hawaii and Alaska since they aren’t included in our map.

gremlin> g.V().has('country', 'US').not(has('region', within('US-AL', 'US-HI'))).count()
==>563 // Airports in the Contiguous United States
gremlin> g.V().has('country', 'US').not(has('type', 'airport')).count()
==>0 // Showing that all vertices in the US are airports

Routes Query

Next let’s look at the query used by API in the demo app responsible for finding routes. It isn’t exactly the same, but very similar to the example at the end of Part 1.

gremlin> start = 'SFO';
gremlin> dest = 'JFK';
gremlin> hops = 2;
gremlin> limit = 5;
gremlin> g.V().has('code', start).repeat(out('route').simplePath()).times(hops).has('code', dest).path().by(valueMap()).limit(limit);

 

In this example I declared 4 variables that my app populated through text boxes and dropdowns. We’re looking for the vertex whose code property matches the start variable. In this case that happens to be set to ‘SFO’. Once we find the vertex we use the repeat step to branch out edges labeled route. Simplepath is then used to prevent traversing the same vertex a second time. Otherwise without limits we could loop indefinitely. Next, we use the times() method to tell repeat how many times we want to iterate going out a route edge before we hit the vertex matching dest, a.k.a. ‘JFK’. The path step will return the vertices that were traversed. Next, the by step is telling gremlin to return a valueMap for the vertices traversed in the path step. Finally, the limit step is used so that only the first 5 results received will be returned.

Populate the coords property

I actually needed to ask a colleague (Thanks Jason!) for help with this query. While it would be relatively simple to write a small, groovy script that iterated through all the vertices and populated the coords property with the data from the lat and lon properties, I really wanted it in a single line of gremlin. This query is definitely beyond beginner level gremlin so I’ll explain it piece by piece below.

gremlin>g.V().local( __.as('x').map{ Geoshape.point(it.get().values('lat').next(), it.get().values('lon').next()) }.as('coords').select('x').property('coords', select('coords')) ).iterate()

 

To start, in case you hadn’t noticed all traversal steps are separated by a period. Our graph traversal is bound to g. We list all of our vertices with V(). to iterate through the vertices one by one we use local(). Once we’re dealing with an anonymous(__) individual vertex we use as to map it to ‘x’ so we can refer back to it later. Map allows us to interact with the traverser directly. The Geoshape.point() method takes two doubles (floating point numbers) for latitude and longitude. The traverser, is then referred to by it. We call the get() method to get the anonymous vertex, then values() to get the value of the lat and lon properties. Next() is used to return the value we retrieved. Once we’ve passed both values to Geoshape.point(), we use the as step to map it to coords. Then we use select to start back at the vertex and we call the add property() method and pass it a string with the name, coords, and then use select(‘coords’) to pass in the latitude and longitude that we converted into a geoshape point. Iterate() is used so suppress output. Phew.

Typeahead TEXTSTRING Query

For our typeahead query we’re taking a string and checking it against four fields code, icao, desc, and city.

gremlin> g.V().or(has('desc', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('code', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('icao', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*')), has('city', textRegex('.*[Ss][Aa][Nn] [Jj][Oo].*'))).propertyMap('code', 'desc')
==>{code=[vp[code->SJC]], desc=[vp[desc->Norman Y. Mineta ...]]}
==>{code=[vp[code->SJE]], desc=[vp[desc->Jorge E. Gonzalez...]]}
==>{code=[vp[code->SJO]], desc=[vp[desc->San Jose, Juan Sa...]]}
==>{code=[vp[code->SYQ]], desc=[vp[desc->Tobias Bolanos In...]]}
==>{code=[vp[code->SJD]], desc=[vp[desc->Los Cabos Interna...]]}
==>{code=[vp[code->SJI]], desc=[vp[desc->San Jose Airport]]}

 

We’re searching all the vertices with g.V() again. The or step will return all results from the comma delimitated has() steps. The final propertyMap(‘code’, ‘desc’) step is saying to return a map (json) with the ‘code’ and ‘desc’ properties for each match. For each has() step we’re specifying one of the four properties in our index and performing a textRegex match. The .*’s at the beginning and the end will match anything which means we don’t care where it finds ‘san jo’ in the string.

Geospatial Query

To get a list of airports close by we can use geoWithin on our coords property.

gremlin> g.V().has("coords", geoWithin(Geoshape.circle(37.618,-122.375,50))).valueMap(true)
==>{country=[US], code=[SJC], longest=[11000], city=[San Jose], lon=[-121.929000854492], label=airport, type=[airport], id=24648, coord=[POINT (-121.929001 37.362598)], elev=[62], icao=[KSJC], region=[US-CA], runways=[3], coords=[POINT (-121.929001 37.362598)], lat=[37.3625984191895], desc=[Norman Y. Mineta San Jose International Airport]}
==>{country=[US], code=[OAK], longest=[10520], city=[Oakland], lon=[-122.221000671387], label=airport, type=[airport], id=33008, coord=[POINT (-122.221001 37.721298)], elev=[9], icao=[KOAK], region=[US-CA], runways=[4], coords=[POINT (-122.221001 37.721298)], lat=[37.7212982177734], desc=[Oakland]}
==>{country=[US], code=[SFO], longest=[11870], city=[San Francisco], lon=[-122.375], label=airport, type=[airport], id=12520, coord=[POINT (-122.375 37.618999)], elev=[13], icao=[KSFO], region=[US-CA], runways=[4], coords=[POINT (-122.375 37.618999)], lat=[37.6189994812012], desc=[San Francisco International Airport]}

 

In this example I passed rounded values for the latitude and longitude of SFO and a radius of 50 to Geoshape.circle(). As you can see SFO, SJC, and OAK were returned.

Wrapping Up

I originally planned to have some administrative operations in this post, but I decided to split it out and turn this into a four-part series. Part three will have a lite overview on some administrative operations. Mainly focusing on how to make copies of your environment. Part four will mainly focus on a demo app that’s currently functional, but lacking the typeahead and geospatial queries I’m hoping to implement. I’ll also look at some of the non DIY options for visualization as well. Lastly, keep an eye out for JanusGraph 0.2.1 and JanusGraph 0.3.0 which should be out in the near future. Sorry for the wait and I hope you found this helpful.

 

JanusGraph Post Series – Chris Hupman

3 comments on"Getting Started with JanusGraph, Part 2 – Indexes and Traversals"

  1. Albert Lockett August 21, 2018

    Hi is there a reason you can not use (?i) to achieve case insensitive regex?

    g.V().has(‘desc’, textRegex(‘.*(?i)san jo.*’))

  2. Chris_Hupman August 21, 2018

    While I haven’t actually tested it that should work and the performance should be identical. It would also be cleaner to implement on the application. The only downside is whether or not you would remember what (?i) does down the line and would have to google it.

  3. Hi, if I want to set my indices before I load data, how to do? thank you!
    I have set my schema before I load data, but when I finished loading data, there is no indices in the graph!
    After loading data, I try to create indices, but it is very slowly.

Join The Discussion

Your email address will not be published. Required fields are marked *