Taxonomy Icon

Analytics

In this article, I will briefly introduce Data Governance, Apache Atlas, and JanusGraph. I will then describe Apache Atlas’ type and entity system, and how it is modeled to a property graph stored in JanusGraph. After reading this article, you will have a basic understanding of graph-based metadata management in enterprise data governance with Apache Atlas as a prime example.

Enterprise data governance is the overall management of data availability, relevancy, usability, integrity, and security in the enterprise. Data governance enables organizations to use their data effectively and efficiently while also meeting regulatory and compliance requirements. Take a business report as an example. A business report is generated by a series of queries to data from data sources. The quality of the report depends on the quality of the data. We can define and store the metadata for the report, for example, the data assets involved in the report and lineage. We can measure the quality of the report on the basis of the data assets, for example, the completeness and accuracy of the data fields in the data assets. Another example is PII – Personal Identification Information. We need to hide it for privacy. We can define a classification called ‘PII’ and set up governance rules to govern it, for example, masking it during ETL.

At the center of modern data governance is the metadata system that describes the data, collects, stores, and exchanges the metadata.

image of circle marked Data Governance surrounding a smaller circle titled Metadata

Metadata is the description of and consolidated catalog of data assets in the enterprise from various data sources. Additionally, it can have the following on top of the data assets:

  • Classifications
  • Business Terms and Categories
  • Relationships / Lineage
  • Policies and Rules
  • Report definitions
  • Machine learning model definitions

Spider graph showing metadata at the center of the other functions

Apache Atlas is Data Governance and MetaData Framework for Hadoop.

process flow graph

Apache Atlas can be summarized as:

  • Type and Entity system to define metadata.
  • Graph repository to store metadata (JanusGraph).
  • Search capability based on Apache Solr.
  • Notification service based on Apache Kafka.
  • APIs to populate and query metadata (Rest API).

Before we come back to describe Apache Atlas’ type and entity system and how it is modeled to a graph in JanusGraph, I want to briefly introduce JanusGraph.

JanusGraph is a scalable graph database with pluggable storage and indexing. It supports the Property Graph model (vertexes, edges, properties), and is fully compliant with Apache TinkerPop graph traversal and computing framework.

There are primarily two ways to interact with JanusGraph. First, we can use the generic TinkerPop Gremlin API to transact and traverse the graph in JanusGraph. The following is an example.

graph = GraphFactory.open(...)
g = graph.traversal()
jupiter = g.addV("god").property("name", "jupiter").property("age", 5000).next()
sky = g.addV("location").property("name", "sky").next()
g.V(jupiter).as("a").V(sky).addE("lives").property("reason", "loves fresh breezes").from("a").next()
g.tx().commit()
g.V().has("name", "jupiter").valueMap(true).tryNext()

Second, JanusGraph has its own specific APIs that can be used by applications to interact with JanusGraph. JanusGraph management APIs are used to define, update, and inspect schema of a graph. Additionally, there are management APIs to create, update and inspect graph indexes. The following are some examples:

graph = JanusGraphFactory.open(...)
mgmt = graph.openManagement()

mgmt. makeVertexLabel(...)
mgmt. makeEdgeLabel(...)
mgmt. makePropertyKey(...)

mgmt.getVertexLabels()
mgmt.getRelationTypes(EdgeLabel)
mgmt.getRelationTypes(PropertyKey)
mgmt.getGraphIndexes(Edge.class)
mgmt.getGraphIndexes(Vertex.class)
mgmt.getPropertyKey('__superTypeNames').cardinality()
mgmt.getGraphIndex('Asset.name__typeName').isCompositeIndex()
mgmt.getGraphIndex('Asset.name__typeName').getFieldKeys()

JanusGraph direct query APIs can be used to query a graph directly. JanusGraphQuery is a graph-centric query API designed to retrieve vertices, edges, or properties from a graph. JanusGraphVertexQuery is a vertex-centric query API executed for a single vertex to query relationships for the vertex. JanusGraphIndexQuery is a direct index query API against an index backend bypassing regular graph traversal. The following is an example of a direct index query.

mgmt.buildIndex("mixedIndex",
    Vertex.class).addKey(p1).addKey(p2).buildMixedIndex("search");
graph.indexQuery("mixedIndex", "v.*:text").vertexStream().findFirst()

Now let us discuss Apache Atlas’s type and entity system, and how it is mapped to a graph in JanusGraph.

In Atlas, Type is the definition of metadata object, and Entity is an instance of metadata object. For example, ‘hive_table’ is a type in Atlas. ‘demo_table’ is an entity. Atlas has the following system base types pre-defined:

  • Referenceable
  • Asset
  • DataSet
  • Infrastructure
  • Process

Atlas also provide the following models that define types on top of the base types:

  • Glossary model
  • RDBMS model
  • Hive model
  • HBase model
  • hdfs model
  • Kafka model and others

For example, ‘hive_table’, which comes in with the Hive model, is a subtype of DataSet.

The type system is extensible, which means custom types can be defined and added.

The following is a subgraph of the base and Hive types as they are stored in JanusGraph. You can see that each type is represented as a vertex and the relationships among the types are represented as edges in the graph. In this subgraph, the vertexes have the name of the type displayed. The edges have the edge labels displayed.

A hive_table type and its related types as they are stored in JanusGraph.

A subgraph of the above graph shows the ‘hive_table’ type and its related types as they are stored in JanusGraph.

A subgraph of the graph shown in the above image

Coming to Entity, let’s create a hive table called ‘demo_table’:

CREATE TABLE demo_table (
    col1 INT,
    col2 STRING)

This table will show up in Atlas as an entity of type ‘hive_table’ via the hive hook that connects the hive instance with the Atlas server.

screen capture of the creation of demo_table hive_table

The ‘demo_table’ entity will show up as a vertex in JanusGraph. The following subgraph shows that vertex and the edges that connect it to the other entities (columns, database, etc.).

subgraph of hive_table

Let’s create another table to demonstrate the lineage between two entities.

CREATE TABLE demo_table2
AS
(SELECT col1 FROM demo_table)

This will show up in Atlas as a lineage between demo_table2 and demo_table.

screen capture of the demo_table lineage tab

The lineage is represented as a ‘Process’ entity that connects demo_table and demo_table2. Also, there are column-level lineage that connects the source column and the target column. Note that the columns are both called ‘col1’, but they belong to demo_table and demo_table2 respectively.

The following subgraph adds the new vertex to the graph we have shown above. The vertexes are displayed with the type and name of the entities.

subgraph with new vertex added to the graph shown above

Next let’s create a Classification called ‘demo_PII’, and tag demo_table’s col1 with ‘demo_PII’. This is done on Atlas UI.

screen capture of the col1 on hive_column

In JanusGraph, a vertex called ‘demo_PII’ is created, and it is connected to the ‘col1’ vertex via an edge labeled ‘classifiedAs’.

graph showing connections to demo_PII

We enabled propagation when we tagged demo_table’s col1. Therefore demo_table2’s col1 is also classified as ‘demo_PII’

Last, let’s look into the graph labels, indexes, and graph queries for Apache Atlas.

Atlas defines many edge labels to represent relationships among entities.

mgmt.getRelationTypes(EdgeLabel)
    classifiedAs
    __Process.inputs
    __Process.outputs
    ____AtlasUserProfile.savedSearches
    __hive_table.db
    __hive_table.sd
    __hive_table.partitionKeys
    __hive_table.columns
    __hive_storagedesc.table
    __hive_storagedesc.sortCols
    __hive_column.table
    __hive_column_lineage.query
    __hive_storagedesc.serdeInfo
    ...

There is no vertex label in the graph. Type name is a vertex property. The benefit is that vertex property can be indexed.

Atlas makes extensive use of the graph indexes to speed up search and queries of entities and relationships.

mgmt.getGraphIndexes(Vertex.class)

    vertex_index
    fulltext_index
    Referenceable.qualifiedName__typeName
    Referenceable.qualifiedName__superTypeNames
    Asset.name__typeName
    Asset.name__superTypeNames
    Asset.owner__typeName
    Asset.owner__superTypeNames
    __guid  
    __typeName
    __superTypeNames
    __createdBy
    __createdBy__typeName
    __createdBy__superTypeNames
    __modifiedBy
    __modifiedBy__typeName
    __modifiedBy__superTypeNames
    ...

The first index named ‘vertex_index’ is a mixed Index used for direct index query to do entity search. The second index named ‘fulltext_index’ is a mixed Index used for direct index query to do entity text search.

mgmt.getGraphIndex('fulltext_index').getFieldKeys()

    entityText

An example of an entityText is “hive_column owner jinghe qualifiedName demo_db.demo_table.col1@primary name col1 type int table”. We can see here that Atlas created a concatenated text that is used to facilitate full text search of entities.

The index ‘ Referenceable.qualifiedName__typeName’ is a composite Index with two field keys. Queries containing the two fields will make use of this index.

Atlas uses both generic Gremlin APIs and JanusGraph-specific API to query the graph, depending on the use cases and efficiency. For example, to get the lineage for an entity, the following Gremlin query is issued:

g.V().has('__guid', ' c1a65675-d8d9-4262-9fe4-2a7453c3e312 ')
.repeat(__.inE('__Process.outputs').as('e1').outV()
.outE(' __Process.inputs').as('e2').inV()).times(3).emit().select('e1', 'e2').toList()

On the other hand, to find entities by type and property name, use a direct JansuGraphQuery:

AtlasGraphQuery query = AtlasGraphProvider.getGraphInstance().query()
    .has('__typeName', 'hive_table').has('owner', 'root')
.has('__state', 'ACTIVE');
Iterator<AtlasVertex> results = query.vertices().iterator();

To get the classifications of an entity, use JanusGraphVertexQuery. It searches for relationships given a vertex.

entityVertex.query().direction(AtlasEdgeDirection.OUT)
    .label(CLASSIFICATION_LABEL).edges()

Summary

In this article, we focused Apache Atlas as an example to explain and demonstrate graph-based metadata management in enterprise governance. Graph model provides the flexibility to model and store meta data types and data assets.