In this article, I will briefly introduce Data Governance, Apache Atlas, and JanusGraph. I will then describe Apache Atlas’ type and entity system, and how it is modeled to a property graph stored in JanusGraph. After reading this article, you will have a basic understanding of graph-based metadata management in enterprise data governance with Apache Atlas as a prime example.
Enterprise data governance is the overall management of data availability, relevancy, usability, integrity, and security in the enterprise. Data governance enables organizations to use their data effectively and efficiently while also meeting regulatory and compliance requirements. Take a business report as an example. A business report is generated by a series of queries to data from data sources. The quality of the report depends on the quality of the data. We can define and store the metadata for the report, for example, the data assets involved in the report and lineage. We can measure the quality of the report on the basis of the data assets, for example, the completeness and accuracy of the data fields in the data assets. Another example is PII – Personal Identification Information. We need to hide it for privacy. We can define a classification called ‘PII’ and set up governance rules to govern it, for example, masking it during ETL.
At the center of modern data governance is the metadata system that describes the data, collects, stores, and exchanges the metadata.
Metadata is the description of and consolidated catalog of data assets in the enterprise from various data sources. Additionally, it can have the following on top of the data assets:
- Business Terms and Categories
- Relationships / Lineage
- Policies and Rules
- Report definitions
- Machine learning model definitions
Apache Atlas is Data Governance and MetaData Framework for Hadoop.
Apache Atlas can be summarized as:
- Type and Entity system to define metadata.
- Graph repository to store metadata (JanusGraph).
- Search capability based on Apache Solr.
- Notification service based on Apache Kafka.
- APIs to populate and query metadata (Rest API).
Before we come back to describe Apache Atlas’ type and entity system and how it is modeled to a graph in JanusGraph, I want to briefly introduce JanusGraph.
JanusGraph is a scalable graph database with pluggable storage and indexing. It supports the Property Graph model (vertexes, edges, properties), and is fully compliant with Apache TinkerPop graph traversal and computing framework.
There are primarily two ways to interact with JanusGraph. First, we can use the generic TinkerPop Gremlin API to transact and traverse the graph in JanusGraph. The following is an example.
graph = GraphFactory.open(...) g = graph.traversal() jupiter = g.addV("god").property("name", "jupiter").property("age", 5000).next() sky = g.addV("location").property("name", "sky").next() g.V(jupiter).as("a").V(sky).addE("lives").property("reason", "loves fresh breezes").from("a").next() g.tx().commit() g.V().has("name", "jupiter").valueMap(true).tryNext()
Second, JanusGraph has its own specific APIs that can be used by applications to interact with JanusGraph. JanusGraph management APIs are used to define, update, and inspect schema of a graph. Additionally, there are management APIs to create, update and inspect graph indexes. The following are some examples:
graph = JanusGraphFactory.open(…) mgmt = graph.openManagement() mgmt. makeVertexLabel(…) mgmt. makeEdgeLabel(…) mgmt. makePropertyKey(…) mgmt.getVertexLabels() mgmt.getRelationTypes(EdgeLabel) mgmt.getRelationTypes(PropertyKey) mgmt.getGraphIndexes(Edge.class) mgmt.getGraphIndexes(Vertex.class) mgmt.getPropertyKey('__superTypeNames').cardinality() mgmt.getGraphIndex('Asset.name__typeName').isCompositeIndex() mgmt.getGraphIndex('Asset.name__typeName').getFieldKeys()
JanusGraph direct query APIs can be used to query a graph directly. JanusGraphQuery is a graph-centric query API designed to retrieve vertices, edges, or properties from a graph. JanusGraphVertexQuery is a vertex-centric query API executed for a single vertex to query relationships for the vertex. JanusGraphIndexQuery is a direct index query API against an index backend bypassing regular graph traversal. The following is an example of a direct index query.
mgmt.buildIndex("mixedIndex", Vertex.class).addKey(p1).addKey(p2).buildMixedIndex("search"); graph.indexQuery("mixedIndex", "v.*:text").vertexStream().findFirst()
Now let us discuss Apache Atlas’s type and entity system, and how it is mapped to a graph in JanusGraph.
In Atlas, Type is the definition of metadata object, and Entity is an instance of metadata object. For example, ‘hive_table’ is a type in Atlas. ‘demo_table’ is an entity. Atlas has the following system base types pre-defined:
Atlas also provide the following models that define types on top of the base types:
- Glossary model
- RDBMS model
- Hive model
- HBase model
- hdfs model
- Kafka model and others
For example, ‘hive_table’, which comes in with the Hive model, is a subtype of DataSet.
The type system is extensible, which means custom types can be defined and added.
The following is a subgraph of the base and Hive types as they are stored in JanusGraph. You can see that each type is represented as a vertex and the relationships among the types are represented as edges in the graph. In this subgraph, the vertexes have the name of the type displayed. The edges have the edge labels displayed.
A subgraph of the above graph shows the ‘hive_table’ type and its related types as they are stored in JanusGraph.
Coming to Entity, let’s create a hive table called ‘demo_table’:
CREATE TABLE demo_table ( col1 INT, col2 STRING)
This table will show up in Atlas as an entity of type ‘hive_table’ via the hive hook that connects the hive instance with the Atlas server.
The ‘demo_table’ entity will show up as a vertex in JanusGraph. The following subgraph shows that vertex and the edges that connect it to the other entities (columns, database, etc.).
Let’s create another table to demonstrate the lineage between two entities.
CREATE TABLE demo_table2 AS (SELECT col1 FROM demo_table)
This will show up in Atlas as a lineage between demo_table2 and demo_table.
The lineage is represented as a ‘Process’ entity that connects demo_table and demo_table2. Also, there are column-level lineage that connects the source column and the target column. Note that the columns are both called ‘col1’, but they belong to demo_table and demo_table2 respectively.
The following subgraph adds the new vertex to the graph we have shown above. The vertexes are displayed with the type and name of the entities.
Next let’s create a Classification called ‘demo_PII’, and tag demo_table’s col1 with ‘demo_PII’. This is done on Atlas UI.
In JanusGraph, a vertex called ‘demo_PII’ is created, and it is connected to the ‘col1’ vertex via an edge labeled ‘classifiedAs’.
We enabled propagation when we tagged demo_table’s col1. Therefore demo_table2’s col1 is also classified as ‘demo_PII’
Last, let’s look into the graph labels, indexes, and graph queries for Apache Atlas.
Atlas defines many edge labels to represent relationships among entities.
mgmt.getRelationTypes(EdgeLabel) classifiedAs __Process.inputs __Process.outputs ____AtlasUserProfile.savedSearches __hive_table.db __hive_table.sd __hive_table.partitionKeys __hive_table.columns __hive_storagedesc.table __hive_storagedesc.sortCols __hive_column.table __hive_column_lineage.query __hive_storagedesc.serdeInfo …
There is no vertex label in the graph. Type name is a vertex property. The benefit is that vertex property can be indexed.
Atlas makes extensive use of the graph indexes to speed up search and queries of entities and relationships.
mgmt.getGraphIndexes(Vertex.class) vertex_index fulltext_index Referenceable.qualifiedName__typeName Referenceable.qualifiedName__superTypeNames Asset.name__typeName Asset.name__superTypeNames Asset.owner__typeName Asset.owner__superTypeNames __guid __typeName __superTypeNames __createdBy __createdBy__typeName __createdBy__superTypeNames __modifiedBy __modifiedBy__typeName __modifiedBy__superTypeNames …
The first index named ‘vertex_index’ is a mixed Index used for direct index query to do entity search. The second index named ‘fulltext_index’ is a mixed Index used for direct index query to do entity text search.
An example of an entityText is “hive_column owner jinghe qualifiedName demo_db.demo_table.col1@primary name col1 type int table”. We can see here that Atlas created a concatenated text that is used to facilitate full text search of entities.
The index ‘ Referenceable.qualifiedName__typeName’ is a composite Index with two field keys. Queries containing the two fields will make use of this index.
Atlas uses both generic Gremlin APIs and JanusGraph-specific API to query the graph, depending on the use cases and efficiency. For example, to get the lineage for an entity, the following Gremlin query is issued:
g.V().has('__guid', ' c1a65675-d8d9-4262-9fe4-2a7453c3e312 ') .repeat(__.inE('__Process.outputs').as('e1').outV() .outE(' __Process.inputs').as('e2').inV()).times(3).emit().select('e1', 'e2').toList()
On the other hand, to find entities by type and property name, use a direct JansuGraphQuery:
AtlasGraphQuery query = AtlasGraphProvider.getGraphInstance().query() .has('__typeName', 'hive_table').has('owner', 'root') .has('__state', 'ACTIVE'); Iterator<AtlasVertex> results = query.vertices().iterator();
To get the classifications of an entity, use JanusGraphVertexQuery. It searches for relationships given a vertex.
In this article, we focused Apache Atlas as an example to explain and demonstrate graph-based metadata management in enterprise governance. Graph model provides the flexibility to model and store meta data types and data assets.