Contents


Overview

Skill Level: Intermediate

IGC, SQL Replication, DB2, SQL, SQL/XML, XML, REST API, cURL

Many DWH architectures leverage DataPropagator (a.k.a. DProp or InfoSphere Data Replication SQL Replication) to stage delta data from relational sources for many years. Would not it be nice to see those replication subscriptions in data lineage?

Ingredients

This recipe introduces a parser to transform IBM InfoSphere Data Replication SQL Replication (a.k.a. DataPropagator) metadata into the Open IGC XML format which is used to define custom assets and lineage flows in IBM InfoSphere Information Governance Catalog.

The parser uses the same techniques and ingredients as the recently introduced parser for Q Replication metadata exchange. It also reads replication metadata stored in relational tables at the SQL Capture database and the SQL Apply control server database and transforms it into the Open IGC XML format with SQL/XML.

Please refer to the mentioned recipe here https://developer.ibm.com/recipes/tutorials/extend-your-information-governance-catalog-with-q-replication-metadata/ for a detailed introduction of all ingredients. Here's just the brief list:

  • IBM InfoSphere Information Governance Catalog
  • IBM InfoSphere Data Replication - SQL Replication
  • Data Lineage
  • Open IGC
  • IGC REST API
  • Database Federation
  • cURL

As for the Q Replication parser, all ingredients are taken from the "DBA" shelf and the "Governance" shelf - no programming or custom programs required. All is open, easy adoptable, easy to maintain.

Step-by-step

  1. Differences between DataPropagator (SQL Replication) and Q Replication

    DataPropagator (SQL Replication) is the “mother of all replication technologies” for DB2. It contains log based change capture to identify relevant insert, update, or delete operations and replicates those to the replication targets.

    In contrast to the realtime approach of Q Replication DataPropagator Capture stages all identified changes in staging tables at the replication source. These staging tables are also referred to as change data (CD) tables. The replication target engine (called Apply) polls these CD tables to fetch new changes and to replicate those to the replication target tables.

    This approach is not ideal for reatime replication, but scales enourmously well in terms of the number of replication targets – because each change is staged only once and can be fetched hundreds or even thousands of times with minimal additional overhead.

    As with Q Replication, the runtime processes are called Capture and Apply. Multiple Apply processes can replicate from the same Capture process. To separate these Apply processes, the term “Apply Qualifier” was introduced to uniquely identifiy these processes.

    Each Apply Qualifier maintains a number of Subscription Sets. These sets contain the actual tables called Subscription Set Member. Each Subscription Set can contain up to 250 members, which replicate in a transactionally consistent manner.

    This results in a metadata hierarchie like this:

    • 1 SQL Apply Control Server
      • n Apply Qualifiers
        • m Subscription Sets per Apply Qualifier
          • x Subscription Set Members (Tables) per Set

    The below screen shot displays a view of the graphical administration interface called Replication Center. It is optionally used to define replication subscriptions for both Q Replication and SQL Replication.

    The definitions for SQL Replication display the SQL Apply Control Server TARGETDB. The Control Server contains four Subscriptions Sets – three for Apply Qualifier CLENKE and one for Apply Qualifier ASNCLP. All replicate data from the same Capture server: SOURCEDB. The properties for Subscription Set MYSET are displayed in the separate window. It shows that the set contains four Subscription Set Members of different target table types.

    db2rc_SQLRep

    All definitions made with the Replication Center are stored in the replication control tables at the Capture server database and the Apply control server database.

  2. See how it was done for IBM Q Replication already (and how the same techniques can be leveraged for SQL Replication)

    The developerWorks recipe “Extend your Information Governance Catalog with Q Replication Metadata” (https://developer.ibm.com/recipes/tutorials/extend-your-information-governance-catalog-with-q-replication-metadata/) introduced the techniques to access the Q Replication metadata from Q Replication control tables, to transform the metadata into XML documents using SQL/XML and to load the XML documents into Information Governance Catalog using IGC REST API and cURL. The parser for SQL Replication introduced here works similarly.

    Access Replication metadata

    SQL Replication also stores all metadata in relational tables. The metadata tables for the Capture process are located in the replication source database. The metadata tables for the Apply process are located in an Apply control server database – which can be the target database, the source database, or a dedicated third database. The parser to transform the replication metadata into the Open IGC format needs to join the Capture and Apply control tables. 

    To be able to join the Capture and Apply control tables, Nicknames for either the Capture or Apply control tables (or both) have to be created. Usually the Capture control tables are being federated into the Apply control server database or vice versa. Exception: DB2 for z/OS does not support the concept of federation. If both the Capture and the Apply databases are on z/OS, it is required to create a 3rd DB2 LUW database to federate both the Q Capture and Q Apply control tables into that database and run the parser against this DB2 LUW database.

    Scripts are available to federate both the SQL Capture and SQL Apply control tables:

    • federation_of_source_database.sql: Federation of the control tables of the Capture Server
    • federation_of_control_server_database.sql: Federation of the control tables of the Apply Control Server

    After setting the contained variables to the correct values the scripts can be invoked by standard DB2 means:

    db2 -tvf <scriptname>

    Transform Replication metadata into the Open IGC XML format

    The XML generator used here to transform the relational Replication metadata into the Open IGC asset and flow XML format leverages SQL/XML functions of DB2. Two SQL statements are required (which are using the same base structure and join criterions defined as a common table expression):

    • SQL/XML script to create the Open IGC XML file for asset publishing
    • SQL/XML script to create the Open IGC XML file for flows

    Refer to the mentioned Q Replication recipe to learn about the used techniques.

    The generator parses Replication metadata by “topology”. A topology is determined by the Capture server, the Capture schema, and the Apply Control Server (which always has the same schema 'ASN'). Example: Server: DBA / Capture schema: ASN paired with Apply Control Server: DBB (schema: ASN). The Apply Control Server can contain metadata for multiple target databases.

    If you are replicating multiple “topologies”, run the XML generator multiple times to store all your topologies in Information Governance Catalog.

    Load the generated XML documents into IGC

    After having transformed the Replication metadata into Open IGC XML for assets and flows in the previous step the generated XML files can be loaded into IGC. This is done using the IGC REST API.

    The Replication XML generator calls the IGC REST API from a script leveraging cURL. With cURL REST APIs can be called by command (and therefore can be easily used in scripts). Refer to this developerWorks recipe to learn more about interacting with governance metadata in IGC using REST APIs with cURL: https://developer.ibm.com/recipes/tutorials/interact-with-your-governance-metadata-in-igc-using-rest-apis-with-curl/

    Alternatively the generated XML can be uploaded using the IGC REST Explorer. This option is recommended to check detailed error messages if the cURL command returns an unsuccessful return code.

  3. The SQL Replication IGC asset bundle (and how to load the asset metadata)

    Open IGC – the framework to extend the standard IGC metadata categories (such as databases, tables, BI and data models, business terms, policies and rules, etc.) with custom categories – allows to specify metadata categories as detailed as required.

    Definition of an asset bundle for Data Propagator (SQL Replication)

    The below figure shows how InfoSphere Governance Catalog looks after the standard IGC metadata categories have been complemented by custom metadata categories. “Data Propagator” is one of them. For Data Propagator, the metadata objects

    • SQL Replication Topologies,
    • Apply Qualifiers,
    • Subscription Sets,
    • Subscription Set Members,
    • Replication Tasks, and
    • Subscription Columns

    have beed defined.

    SQLRepBundle

    Parsing of Replication metadata to generate a bundle asset XML document

    To parse the replication control tables and to generate an IGC assets XML document the following script is available:

    • select_SQLRep_xmlForAssetPublishing.sql: generating XML for IGC bundle assets

    After setting the contained variables to the correct values the scripts can be invoked by standard DB2 means:

    db2 -txf <scriptname> > <xmlfilename>

    The option -x means that no column headers but only the selected column values (a XML CLOB in this case) are written to the output. The output is piped to a file.

    Note:

    The generated files can become big because of the CLOB output data type and the concatenation of multiple CLOBs due to the chosen approach. The following command can be used to shrink the files by eliminating consequtive blanks:

    sed -i -e 's/ //g' <xmlfilename>

    Loading the bundle asset XML document into Information Governance Catalog

    After having transformed the SQL Replication metadata into Open IGC XML for assets the generated XML files can be loaded into IGC. This is done using the IGC REST API.

    Command to call the IGC Rest API to post bundle assets using cURL:

    curl -v --cacert certdata -H "Content-Type:application/xml" 
    -H "Accept-Encoding:identity" --data "@select_SQLRep_xmlForAssetPublishing.xml"
    -u igcuid:pwd https://<igc_server>:<port>/ibm/iis/igc-rest/v1/bundles/assets

    Reviewing the results

    After loading the asset XML for the sample topology explained in step 1 of the recipe the custom category “Data Propagator” contains all metadata for the topology with Capture Server SOURCEDB, Capture Schema ASN, and Apply Control Server TARGETDB.

    Comparing it with the screen shot of the Replication Center in Step 1 we see both Apply Qualifiers (ASNCLP and CLENKE) and all four Subscription Sets are contained in the metadata tree. The Subscription Set MYSET is highlighted, the right hand side of the Governance Catalog view displays all details parsed from that set. Additionally, the metadata tree contains all Subscription Set Members and all columns which are replicated for that member.

    MySetDetails

     

  4. SQL Replication Subscriptions in data lineage (and how to load the flows)

    Parsing of Replication metadata to generate a flow XML document

    To parse the replication control tables and to generate an IGC flow XML document the following script is available:

    • select_SQLRep_xmlForFlows.sql: generating XML for IGC flows

    After setting the contained variables to the correct values the scripts can be invoked by standard DB2 means:

    db2 -txf <scriptname> > <xmlfilename>

    Again the option -x means that no column headers but only the selected column values (a XML CLOB in this case) are written to the output. The output is piped to a file and can be shrinked as shown before:

    sed -i -e 's/ //g' <xmlfilename>

    For data lineage, the goal was to represent the Subscription Set as node in the default lineage graph and to be able to drill down into the details from there (up to the selection of dedicated columns to review column lineage). This was achieved by defining expandableInLineage=”true” for the subscription set class. But anyway, the lowest level lineage (on column level) is defined in the flow XML document. The SQL/XML script selects all column mappings and aggregates the XML elements up to the Subscription Set level, like in this example:

    flow_xml
     

    Loading the flow XML document into Information Governance Catalog

    After having transformed the SQL Replication metadata into Open IGC XML for flows the generated XML files can be loaded into IGC. This is done using the IGC REST API.

    Command to call the IGC Rest API to post bundle assets using cURL:

    curl -v --cacert certdata -H "Content-Type:application/xml" 
    -H "Accept-Encoding:identity" --data "@select_SQLRep_xmlForFlows.xml"
    -u igcuid:pwd https://<igc_server>:<port>/ibm/iis/igc-rest/v1/flows/upload

    Reviewing the results

    After loading the flow XML for the sample topology explained in step 1 the custom assets are enriched with usage information.

    Although the XML file to load the flows contained subflows on column level, IGC aggregates the lineage information up along the hierarchie automatically. The usage information for Subscription Set “MYSET” displays that the set reads from three source tables and writes to four target tables.

    MYSET_Usage

    When klicking on the lineage button the relationship betwwen the source and target tables is displayed as graph:

    MYSET_Lineage

    The EMPLOYEE table seems to replicate to two target tables. To check if that is the case, just invoke lineage for the EMPLOYEE source table right from the above graph:

    EMPLOYEE_Lineage

    Same as for the Q Replication bundle the user can drill even deeper, for example to display column lineage for a target table (or source table). The following figure displays column lineage for target table EMPLOYEE. Only a subset of columns was selected. The lineage graph displays that for example the source column SALARY replicates to two target columns, namely SALARY and XSALARY. XSALARY is the Before Image (the value of SALARY before an UPDATE) in SQL Replication terms.

    Lineage_Details

  5. Combined data lineage of replication and DataStage

    A lot of IBM customers are using SQL Replication for many years to feed relational data into the staging area of a data warehouse or data lake (just one of the many SQL Replication use cases). Optionally the replicated data is not a 1:1 mirror of the source, but a continuous change history. This staged history often is the source for subsequent ETL processing. A big advantage of the staging approach is that the ETL processes only have to transform changed data, so that the ETL resources can be used efficiently. And – also an important advantage – that way replication directly contributes to historization by providing not only changed attributes but also the exact timestamp of the change.

    Imagine an ETL process – implemented with Information Server's DataStage ETL module – that reads from a staging table provided by SQL Replication and that writes to some other table (e.g. to a historized dimension table). After uploading SQL Replication asset and flow metadata as described in this recipe the lineage graph in Information Governance Catalog contains both 'data movers' – the standard DataStage ETL jobs which are availabe in data lineage immediately after saving the job and the custom assets as imported with the metadata parser.LineageWithDataStage

    The above figure is just a section of the bigger end to end lineage graph, reaching from the operational replication sources on the left to the data warehouse tables and reports on the right. With the possibility of the Open IGC framework to define icons for custom metadata assets the user can easily distinguish the different assets in the graph.

  6. Wrapping it up

    IBM SQL Replication is a data replication technology contained in IBM InfoSphere Data Replication. Among the many SQL Replication uses cases is data provisioning for analytics and data warehouse systems.

    Information Governance Catalog provides comprehensive capabilities for the management of business metadata, technical metadata, and operational metadata, including data lineage visualization and reporting. With the Open IGC framework, IGC's standard lineage focussing on DataStage ETL processing can be extended to include any relevant process and metadata.

    This recipe explains the techniques to leverage the Open IGC framework to load SQL Replication metadata into the IGC repository. First a new metadata category was defined for SQL Replication (a.k.a Data Propagator). Thereafter SQL (especially SQL/XML) was used to transform the relational SQL Replication metadata into the XML format required by Open IGC. IGC REST API was used to load the generated XML. This parsing approach is easy to automate and allows to load big replication scenaios into IGC without manual interaction.

    All together the explained techniques extend the standard IGC data lineage with SQL Replication subscriptions to be able to fully understand, govern, document, and report end-to-end data integration and provisioning.

Join The Discussion