Overview

Skill Level: Intermediate

SQL Replication skill expected

More and more analytics use cases require real time data and go beyond relational databases. This recipe explains how real time data provisioning can be achieved with IBM SQL Replication -- and how SQL Replication can be extended to provision any target.

Ingredients

After having successfully published the previous recipe about how to provision real time data to any target using IBM Q Replication I want to describe how a very similar result can be achieved with IBM SQL Replication, too. This approach could be relevant for scenarios which are running SQL Replication today.

IBM SQL Replication

Many years ago SQL Replication (formerly known as DataPropagator or DProp) was the first member of the IBM data replication solution nowadays bundled into the IBM offering "IBM Data Replication". SQL Replication includes log based change capture for Db2 z/OS or Db2 LUW databases and supports Db2 z/OS and Db2 LUW replication targets. In contrast to Q Replication or CDC which both focus on real time data delivery, SQL Replication follows a stage and forward approach. This staging approach is very flexible (e.g., in terms of data transformations), and simple to manage (it can be maintained, managed, monitored with pure DBA skills, mainly with SQL). SQL Replication has a broad customer base which could use the procedures described in this recipe to integrate mission critical data with NoSQL replication targets, streaming analytics, real time decisions, etc. 

Extending SQL Replication to any target

Modern data lake architectures usually contain multiple zones, which are usually implemented by different data lake technologies and repositories. Some are relational, some go beyond relational. Some analyze data at rest, some analyze data in motion. Some leverage classic ETL tools for data provisioning, some streaming and event processing using open source technologies like Apache Kafka or Apache Spark.

One efficient data provisioning best practice is to combine data replication technology with ETL (extract / transform / load) tools like IBM DataStage: Data replication to least intrusively capture and optionally historize transactional data from relational sources in near real time, ETL to transform the data and to move it to the analytics repository.

Today, with the number of analytics technologies growing nearly every day (Hadoop, Spark, Kafka, ...), analytics architectures are challenged to extend the reach of existing data provisioning solutions beyond relational. Solutions could be to either implement new point-to-point replication routes with dedicated technologies, or to extend existing replication solutions with some add-ons (which is the focus of this recipe).

Fetching changed data directly from SQL Replication CD tables

One way to feed new analytics solutions is to couple an existing SQL Replication with any appropriate technology. This recipe documents this very flexible and streamlined approach which uses SQL for fetching changed data directly from SQL Replication CD tables (change data tables). The approach opens up the mature SQL Replication technology to provision any target. Combined with micro-batching, the provisioning even can be close to real time.

Step-by-step

  1. Overview

    Standard SQL Replication procedures allow to replicate relational data (from Db2 z/OS or Db2 LUW source databases) to relational targets (Db2 z/OS or Db2 LUW target databases).

    To achieve this, SQL Replication

    • captures changed data by monitoring the source database’s log
    • stores all captured changes in CD (changed data) tables within the source database (one CD table per captured source table) – as soon as the commit record of the source transaction has been read from the log
    • applies the database changes to target tables

    The SQL Replication runtime comprises of the SQL Capture process (monitoring the source database’s log and publishing relevant changes to CD tables) and the SQL Apply process (which reads the CD tables, converts the changes to SQL commands, and which executes these SQL commands against the target database).

    SQL_Rep_overview

    But what if the designated replication target isn’t a relational database?

    A straightforward option to extend the reach of SQL Replication is to poll the changed data directly from the CD tables using SQL and to feed this data into any API.

    It is even possible to preserve the low latency capturing mechanism by forwarding the data off the changed data tables in a micro-batching fashion. The following sections explain how.

  2. SQL Replication CD tables

    Change Data (CD) Tables

    CD tables (a.k.a. change data tables) are proper database tables. For IBM SQL Replication CD tables are usually used as internal staging area between SQL Capture and SQL Apply. SQL Replication uses one CD table per replicated source table.

    CD tables contain a full history of changes for a source table detected from the source database’s log: As soon as SQL Capture has read source transaction from the database log it inserts all changed rows into the effected CD tables.

    A CD table has all relevant source data columns (all source table columns or only a subset of source columns) plus 3 replication specific columns:

    • IBMSNAP_COMMITSEQ: The source database’s commit LSN of the transaction
    • IBMSNAP_INTENTSEQ: The source database’s log record LSN of the specific row
    • IBMSNAP_OPERATION: I(nsert), U(pdate), or D(elete)

    CD tables can contain before image columns, which can be very relevant when the source application allows updates to the replication key columns (to the columns of the business key). A before image column contains the column value before an update was executed at the source. All other columns (called after image columns) contain the column value after an update was executed at the source. A CD table can contain before image columns for all columns or just for key columns (usually recommended).

    The Unit-of-Work (UOW) table

    The unit-of-work table (IBMSNAP_UOW) can be considered as a global change data table. SQL Capture inserts a row into the UOW table for every transaction COMMIT log record which effected a replicated table. The UOW table contains transaction specific information such as

    • IBMSNAP_LOGMARKER: The source database’s commit timestamp of the captured transaction
    • IBMSNAP_AUTHID: The authorization ID committing the transaction

    The UOW table can be joined with any CD table using column IBMSNAP_COMMITSEQ in case transaction specific information (e.g., the commit timestamp) also need to be forwarded to the replication target.

    Fetching new changes from CD tables

    Although CD tables were not originally intended as external API, they perfectly can be used as such in a simple way respecting the SQL Replication protocol.

    Because CD target tables contain the source log sequence number (LSN) for each row it is very easy for ETL tools or any forwarding application to

    • fetch the most recent records from a CCD table — in other words to fetch all rows from the CD table which have not been fetched before
    • perserve the source transaction order

    Use a SQL query like this to fetch rows from a CD table:

    SELECT <business columns>
    FROM CD_TABLE
    WHERE IBMSNAP_COMMITSEQ > <lower limit>
    AND IBMSNAP_COMMITSEQ <= <upper limit>
    ORDER BY IBMSNAP_COMMITSEQ, IBMSNAP_INTENTSEQ;

    The following figure shows examples how to forward data from a SQL Replication CD table to any target.fetch_from_cd

    ETL tools or applications: One strength of modern ETL (extract / transform / load) tools is connectivity. To read from relational and to write to Hadoop, Kafka, Spark, or other analytics targets can be achieved with many ETL tools. Typically, ETL processes are implemented as batch processes, forwarding data from a staging area (CCD) to a target for example once per day.

    In-database ELT: ELT is an acronym for in database ETL. ELT eliminates to extract data into an ETL tool before writing it to a target. If both the CD table and the final target are relational, and if they are located in the same database (or if they can be virtually integrated into the same database by means of federation) ELT is a very comfortable and fast mechanism leveraging the existing database infrastructure and the existing database skills. SQL (for example INSERT or MERGE) can be used to forward data from a CD table into a final target table.

    ETL or ELT micro-batching: The more operational decisions are footed on data, the lower the replication latency has to be. SQL Replication feeds data into CD tables in near real time. If additional steps are required to forward the replicated data to an additional target, the more frequent these processes have to be called to preserve latency. We call frequent ETL “ETL micro-batching” or “ELT micro-batching”.

    Important:

    When forwarding data from SQL Replication CD tables it is very important to adhere to the SQL Replication protocol, for example to kick off SQL Replication, to maintain a save progress and status, to enable pruning of data from the CD tables after it has been fetched successfully. Refer to the following sections for details.

  3. Prerequisites - Basic setup

    This recipe expects that the basic SQL Replication setup at the source database has been accomplished already.

    SQL Replication Basic Setup

    • The SQL Replication Control Tables have been created at the SQL Capture server (SQL Replication source system / database). We will be using the standard Capture Schema name ‘ASN’.
    • Q Capture is up and running

    If these steps have not been accomplished so far, refer to the standard documentation or to related recipes to do that.

    Adjusting the SQL Capture COMMIT_INTERVAL

    The commit interval indicates how often the Capture program commits data to the Capture control tables, including the UOW and CD tables. A best practice is to batch replication workload with a bigger commit interval (a long time the default has been 30 seconds). On the other hand, if we want to preserve low latency, it is recommended to reduce the commit interval to a more appropriate value. The example will use a commit interval of 10 seconds.

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    -- Setting the SQL Capture Commit Interval to 10 seconds
    update asn.ibmsnap_capparms set commit_interval = 10;

    -- After updating IBMSNAP_CAPPARMS restart the SQL Capture process
  4. Registration of exemplary tables as source for SQL Replication

    This section explains how to define 2 examplary tables as source for SQL Replication. This step is also called “registration of replication source tables”.

    • Source Tables
      • CHRIS.SQLREP_ANY_TARGET_1
      • CHRIS.SQLREP_ANY_TARGET_2
    • CD Tables
      • CHRIS.SQLREP_ANY_TARGET_CD_1
      • CHRIS.SQLREP_ANY_TARGET_CD_1
    • Options
      • Include before image columns for the key columns

     Both source tables were created with these columns:

    create table CHRIS.<table_name> (
    col1 int not null,
    col2 varchar(20),
    col3 varchar(20),
    intcol1 int,
    intcol2 int
    );
    create unique index <table_name> on <table_name> (col1);

    These ASNCLP commands were used to define both tables as replication source for SQL Replication (according to the above options):

    SET OUTPUT CAPTURE SCRIPT ".\output\register.sql";
    SET LOG ".\logs\REGISTER.LOG";
    SET RUN SCRIPT NOW STOP ON SQL ERROR ON;

    SET SERVER CAPTURE TO DB SOURCEDB;
    SET CAPTURE SCHEMA SOURCE ASN;

    #drop registration (CHRIS.SQLREP_ANY_TARGET_SOURCE_1);
    #drop registration (CHRIS.SQLREP_ANY_TARGET_SOURCE_2);


    CREATE REGISTRATION (CHRIS.SQLREP_ANY_TARGET_SOURCE_1)
    DIFFERENTIAL REFRESH
    STAGE CHRIS.SQLREP_ANY_TARGET_CD_1 IN USERSPACE1
    COLS ALL IMAGE AFTER INCLUDE IMAGE BEFORE (COL1)
    OPTIONS FULL REFRESH OFF
    ;

    CREATE REGISTRATION (CHRIS.SQLREP_ANY_TARGET_SOURCE_2)
    DIFFERENTIAL REFRESH
    STAGE CHRIS.SQLREP_ANY_TARGET_CD_2 IN USERSPACE1
    COLS ALL IMAGE AFTER INCLUDE IMAGE BEFORE (COL1)
    OPTIONS FULL REFRESH OFF
    ;

    The execution of the ASNCLP script creates the two CD tables and inserts two rows into the Capture control table ASN.IBMSNAP_REGISTER.

  5. Defining a pseudo subscription with one or more members

    In this section a pseudo subscription for the two source tables will be created. A proper SQL Replication subscription would define how the SQL Apply program processes the replication for these tables. The setup described in this recipe does not use a SQL Apply process. But nonetheless is is required to define a pseudo subscription for 2 reasons:

    1. To be able to start replication for the source tables (CAPSTART processing)
    2. To let SQL Capture delete all rows after they were successfully fetched from the CD tables. This is called pruning.

    A SQL Replication subscription usually is stored in

    • Control tables at the Apply control server (IBMSNAP_SUBS_SET, IBMSNAP_SUBS_MEMBR, IBMSNAP_SUBS_COLS) which we just can skip because we won’t be using SQL Apply
    • Control tables at the Capture server (IBMSNAP_PRUNE_SET, IBMSNAP_PRUNCNTL) 

    This SQL statements define a pseudo subscription for the two tables just within the Capture control tables using the following arbitrary options:

    • Apply Qualifier: ETL
    • Subscription Set Name: ANY_TARGET

    The pseudo subscription definition consists of one INSERT into table IBMSNAP_PRUNE_SET and one INSERT per source table into table IBMSNAP_PRUNCNTL.

    -- Connect to the SQL Capture Server
    connect to SOURCEDB;

    -- 1 INSERT into PRUNE_SET to define the pseudo subscription set
    INSERT INTO ASN.IBMSNAP_PRUNE_SET (
    APPLY_QUAL, SET_NAME, TARGET_SERVER,
    SYNCHTIME, SYNCHPOINT
    ) VALUES (
    'ETL',
    'ANY_TARGET',
    'TARGETDB',
    null,
    X'00000000000000000000'
    );

    -- pseudo subscription member for table 1
    INSERT INTO ASN.IBMSNAP_PRUNCNTL (
    APPLY_QUAL, SET_NAME, CNTL_SERVER, CNTL_ALIAS,
    SOURCE_OWNER, SOURCE_TABLE, SOURCE_VIEW_QUAL,
    TARGET_OWNER, TARGET_TABLE, TARGET_SERVER,
    TARGET_STRUCTURE, MAP_ID,
    PHYS_CHANGE_OWNER, PHYS_CHANGE_TABLE
    ) SELECT
    'ETL',
    'ANY_TARGET',
    'TARGETDB',
    'TARGETDB',
    'CHRIS',
    'SQLREP_ANY_TARGET_SOURCE_1',
    0,
    'CHRIS',
    'SQLREP_ANY_TARGET_TARGET_1',
    'TARGETDB',
    9,
    coalesce ( char(max(INT(MAP_ID)+1) ), '0' ),
    'CHRIS',
    'SQLREP_ANY_TARGET_CD_1' FROM ASN.IBMSNAP_PRUNCNTL;

    -- pseudo subscription member for table 2
    INSERT INTO ASN.IBMSNAP_PRUNCNTL (
    APPLY_QUAL, SET_NAME, CNTL_SERVER, CNTL_ALIAS,
    SOURCE_OWNER, SOURCE_TABLE, SOURCE_VIEW_QUAL,
    TARGET_OWNER, TARGET_TABLE, TARGET_SERVER,
    TARGET_STRUCTURE, MAP_ID,
    PHYS_CHANGE_OWNER, PHYS_CHANGE_TABLE
    ) SELECT
    'ETL',
    'ANY_TARGET',
    'TARGETDB',
    'TARGETDB',
    'CHRIS',
    'SQLREP_ANY_TARGET_SOURCE_2',
    0,
    'CHRIS',
    'SQLREP_ANY_TARGET_TARGET_2',
    'TARGETDB',
    9,
    coalesce ( char(max(INT(MAP_ID)+1) ), '0' ),
    'CHRIS',
    'SQLREP_ANY_TARGET_CD_2' FROM ASN.IBMSNAP_PRUNCNTL;

    With the pseudo subscription defined we are now able to start capturing for the two tables in the next step.

  6. Starting the pseudo subscription

    When a source table is registered for SQL Replication the Capture program does not immediately start to process Db2 log records for that table. The log processing usually starts when the SQL Apply program initializes the replication subscription containing these replication source tables.

    The SQL Apply program kicks off the log capturing by inserting a record into the table IBMSNAP_SIGNAL. As soon as Capture reads that log record it starts to put all changes for that table into the according change data (CD) table.

    Because we are not using SQL Apply we will start the log capturing for all source tables of the pseudo set with a CAPSTART signal ourselves:

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    --
    INSERT INTO ASN.IBMSNAP_SIGNAL (SIGNAL_TIME, SIGNAL_TYPE, SIGNAL_SUBTYPE, SIGNAL_INPUT_IN, SIGNAL_STATE)
    SELECT
    CURRENT TIMESTAMP,
    'CMD',
    'CAPSTART',
    MAP_ID,
    'P'
    FROM ASN.IBMSNAP_PRUNCNTL
    WHERE APPLY_QUAL = 'ETL'
    AND SET_NAME = 'ANY_TARGET'
    AND TARGET_SERVER = 'TARGETDB'
    AND TARGET_OWNER = 'CHRIS'
    AND TARGET_TABLE = 'SQLREP_ANY_TARGET_TARGET_1';


    INSERT INTO ASN.IBMSNAP_SIGNAL (SIGNAL_TIME, SIGNAL_TYPE, SIGNAL_SUBTYPE, SIGNAL_INPUT_IN, SIGNAL_STATE)
    SELECT
    CURRENT TIMESTAMP,
    'CMD',
    'CAPSTART',
    MAP_ID,
    'P'
    FROM ASN.IBMSNAP_PRUNCNTL
    WHERE APPLY_QUAL = 'ETL'
    AND SET_NAME = 'ANY_TARGET'
    AND TARGET_SERVER = 'TARGETDB'
    AND TARGET_OWNER = 'CHRIS'
    AND TARGET_TABLE = 'SQLREP_ANY_TARGET_TARGET_2';

    After having capstarted the tables all INSERT, UPDATE, DELETE activity will be captured by the SQL Capture program and stored within the change data tables.

  7. Reading from CD tables - Complying with the SQL Replication protocol

    If a user wants to use CD tables as a custom source it is important to stick to the SQL Replication protocol and to leverage tuning options the SQL Replication protocol offers.

    Some basics

    SQL Capture reads the Db2 log continuously and buffers all log records of replicated tables in memory. As soon as Capture has read the COMMIT log record of a Db2 transaction which affected replicated tables, Capture inserts all changed rows from the memory buffer into the CD tables.

    Capture itself commits only infrequently, controlled by the parameter COMMIT_INTERVAL. All captured CD table records that were committed by Capture can be retrieved and forwarded from CD tables.

    The SQL Capture global SYNCHPOINT indicates which CD table records Capture has committed already.

    SQL Capture asynchronously deletes rows which are no longer required from CD tables and from the UOW table. This process is called pruning.

    Protocol to read from SQL Replication CD tables

    The protocol documented in this recipe uses a PROGRESS table to intermediately store relevant SYNCHPOINTS retrieved from SQL Replication control tables. It would be possible to just get the SYNCHPOINTS from the SQL Replication control tables, but to use a PROGRESS table has some advantages:

    • it makes the protocol easier
    • it facilitates easier restartability
    • it can be enriched with temporal features or triggers to document the history of ETL executions

    The following figure visualizes the sequence of steps required when reading from SQL Replication CD tables:

    microbatching_flow1

    Step 1: The sequence starts with fetching the SYNCHPOINT of the global record from the SQL Replication control table IBMSNAP_REGISTER. The REGISTER table contains one row with GLOBAL_RECORD=’Y’. As described in the basics, Capture updates this global record with the highest available save COMMIT LSN. Refer to the following figure for more details. This global SYNCHPOINT will determine the upper limit for the next ETL cycle reading from CD tables.

    The global SYNCHPOINT is stored in the PROGRESS table as MAX_REFRESH_COMMITSEQ – for all tables read by the ETL application.

    Step 2: The lower limit which was the upper limit of the previous successful ETL cycle is already stored in the PROGRESS table as MIN_REFRESH_COMMITSEQ.

    Step 3: Next the ETL application fetches all relevant records from all CD tables. The data range is determined by the MIN_REFRESH_COMMITSEQ and MAX_REFRESH_COMMITSEQ.

    This step needs to take the order of the captured rows into account. It is required that the ETL application selects the captured rows with ORDER BY IBMSNAP_COMMITSEQ, IBMSNAP_INTENSEQ. This will return the captured rows in COMMIT order and for each transaction in the correct change sequence.

    This step can (but does not have to) take account the captured operation in CD table column IBMSNAP_OPERATION.

    Step 4: If the ETL process was successful for a CD table, the current upper limit (MAX_REFRESH_COMMITSEQ) can be stored as MIN_REFRESH_COMMITSEQ of the next cycle. If the step should fail or if the ETL process should break before committing step 4 the data would be fetched a second time during the next cycle.

    Step 5: Last not least the upper limit of the current cycle is stored as pruning SYNCHPOINT in SQL Replication control table IBMSNAP_PRUNE_SET. Refer to the following figure for more details. No problem if the step should fail – this just would defer the pruning of already replicated rows.

    control_tables1

    Recommendations

    If the low latency of the SQL Replication process needs to be preserved the ETL process can be run in a micro-batching fashion (repetition of the above steps as often as required). But the ETL application should not run more often that Capture’s COMMIT_INTERVAL – because the global SYNCHPOINT will only advance when Capture commits.

    The more CD tables are taken as source the more expensive will be a micro-batching approach (because the more cursors have to be opened against many tables frequently). SQL Replication helps to save some MIPS by indicating which CD tables received changed records and which did not. The previous figure shows that SQL Capture maintains the column CD_NEW_SYNCHPOINT for every CD table in the SQL Replication control table IBMSNAP_REGISTER. The ETL process only needs to fetch from a CD table if that CD table’s CD_NEW_SYNCHPOINT is greater than the lower limit for that table in the PROGRESS table.

  8. Exemplary setup of ETL / ELT micro-batching step to any target leveraging SQL Replication CD tables

    The ETL application forwarding the replicated records from the CCD table(s) can be implemented with any technology available. It just needs to be able to

    • Connect to the source database where the CD tables are stored – which can be a Db2 z/OS or a Db2 LUW database
    • Connect to final destination (same database, another database, Kafka, Hadoop, Spark, whatever)
    • SELECT the CD table(s)
    • Populate the target objects
    • Record progress

    1) Setup of an ETL progress table

    The ETL processes reading from the SQL Replication CD tables need to maintain their progress somewhere. Because the COMMIT LSN of the captured transactions always advances the COMMIT LSN is a perfect fit for tracking the ETL progress. To record progress the ETL application could use the following table (in the source database where the CD tables are stored):

    • MIN_REFRESH_COMMITSEQ can be used to record the highest previously read COMMIT LSN
    • MAX_REFRESH_COMMITSEQcan be used to to store the highest currently available COMMIT LSN as. This makes sense if multiple tables need to be replicated transactinally consistent up to the same source COMMIT LSN.

    The other columns can be used to identify or group ETL tasks. The purpose of all columns is documented in the below script:

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    -- drop table SQLREP.ETL_PROGRESS;

    create table SQLREP.ETL_PROGRESS (
    -- schema of IBMSNAP_REGISTER
    SQL_CAP_SCHEMA VARCHAR(128) NOT NULL,
    -- schema of the source table
    SOURCE_OWNER VARCHAR(128) NOT NULL,
    -- name of the source table
    SOURCE_TABLE VARCHAR(128) NOT NULL,
    -- name of the final destination (to be able to forward one CCD table to multiple targets
    TARGET_DBALIAS VARCHAR(18) NOT NULL,
    -- schema of the target object
    TARGET_OWNER VARCHAR(128) NOT NULL,
    -- name of the target object
    TARGET_TABLE VARCHAR(128) NOT NULL,
    -- highest successfully forwarded LSN (lower bound for next ETL cycle)
    MIN_REFRESH_COMMITSEQ VARCHAR(16) FOR BIT DATA NOT NULL WITH DEFAULT
    X'00000000000000000000000000000000',
    -- timestamp of highest successfully forwarded LSN (for information only)
    MIN_REFRESH_LOGMARKER TIMESTAMP not null with default '1900-01-01-00.00.00.000001',
    -- highest successfully replicated LSN (upper bound for next ETL cycle)
    MAX_REFRESH_COMMITSEQ VARCHAR(16) FOR BIT DATA NOT NULL WITH DEFAULT
    X'00000000000000000000000000000000',
    -- timestamp of highest successfully replicated LSN (for information only)
    MAX_REFRESH_LOGMARKER TIMESTAMP not null with default '1900-01-01-00.00.00.000001',
    -- execution timestamp of las ETL cycle
    REFRESH_TIMESTAMP TIMESTAMP not null with default '1900-01-01-00.00.00.000001'
    -- optional columns for grouping purposes
    -- ,APPLY_QUAL VARCHAR(18)
    -- ,SET_NAME VARCHAR(18)
    ) ORGANIZE BY ROW;

    create unique index SQLREP.ETL_PROGRESS on SQLREP.ETL_PROGRESS (
    SQL_CAP_SCHEMA,
    SOURCE_OWNER,
    SOURCE_TABLE,
    TARGET_DBALIAS,
    TARGET_OWNER,
    TARGET_TABLE
    );

    2) Registration of each required ETL process

    Each new ETL process can register in the above table like this:

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    -- registering to extract table 1
    insert into SQLREP.ETL_PROGRESS (
    SQL_CAP_SCHEMA,
    SOURCE_OWNER,
    SOURCE_TABLE,
    TARGET_DBALIAS,
    TARGET_OWNER,
    TARGET_TABLE,
    MIN_REFRESH_COMMITSEQ,
    MIN_REFRESH_LOGMARKER,
    MAX_REFRESH_COMMITSEQ,
    MAX_REFRESH_LOGMARKER,
    REFRESH_TIMESTAMP
    -- optional columns for grouping purposes
    -- ,APPLY_QUAL
    -- ,SET_NAME
    ) VALUES (
    'ASN',
    'CHRIS',
    'SQLREP_ANY_TARGET_SOURCE_1',
    'TARGETDB',
    'CHRIS',
    'SQLREP_ANY_TARGET_TARGET_1',
    X'00000000000000000000000000000000',
    '1900-01-01-00.00.00.000001',
    X'00000000000000000000000000000000',
    '1900-01-01-00.00.00.000001',
    '1900-01-01-00.00.00.000001'
    -- optional columns for grouping purposes
    -- ,'ETL'
    -- ,'ANY_TARGET'
    );

    -- registering to extract table 2
    insert into SQLREP.ETL_PROGRESS (
    SQL_CAP_SCHEMA,
    SOURCE_OWNER,
    SOURCE_TABLE,
    TARGET_DBALIAS,
    TARGET_OWNER,
    TARGET_TABLE,
    MIN_REFRESH_COMMITSEQ,
    MIN_REFRESH_LOGMARKER,
    MAX_REFRESH_COMMITSEQ,
    MAX_REFRESH_LOGMARKER,
    REFRESH_TIMESTAMP
    -- optional columns for grouping purposes
    -- ,APPLY_QUAL
    -- ,SET_NAME
    ) VALUES (
    'ASN',
    'CHRIS',
    'SQLREP_ANY_TARGET_SOURCE_2',
    'TARGETDB',
    'CHRIS',
    'SQLREP_ANY_TARGET_TARGET_2',
    X'00000000000000000000000000000000',
    '1900-01-01-00.00.00.000001',
    X'00000000000000000000000000000000',
    '1900-01-01-00.00.00.000001',
    '1900-01-01-00.00.00.000001'
    -- optional columns for grouping purposes
    -- ,'ETL'
    -- ,'ANY_TARGET'
    );

    3) ETL pre-processing

    Before the next ETL cycle starts (for one CD table or for multiple CD tables), the ETL preprocessing fetches the current save SYNCHPOINT from the IBMSNAP_REGISTER global record and stored the SYNCHPOINT in the ETL progress table as MAX_REFRESH_COMMITSEQ (upper limit for next cycle). In this example, the upper limit is updated for all tables replicated from Capture schema ASN. The smallest possible unit would have been the tables belonging to a pseudo subscription set:

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    update SQLREP.ETL_PROGRESS P set
    (MAX_REFRESH_COMMITSEQ, MAX_REFRESH_LOGMARKER, REFRESH_TIMESTAMP)
    = ( select SYNCHPOINT, SYNCHTIME, current timestamp
    from ASN.IBMSNAP_REGISTER R
    where GLOBAL_RECORD = 'Y'
    )
    WHERE SQL_CAP_SCHEMA = 'ASN'
    ;

    4) ETL processing

    The ETL process now fetches all records from the CD that have a higher LSN than the highest LSN of the previous cycle (lower limit) and a lower or equal LSN fetched from IBMSNAP_REGISTER global record (upper limit).

    We want to distinguish two different approaches for fetching from a CD table:

    1. Reading changed data from the CD table, comparable to SQL Replication target structure 9.
    2. Adding the source commit timestamp to the result set by joining the CD table with the global unit of work (UOW) table, comparable to SQL Replication target structure 3.

    SELECT to fetch data from the CD table:

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    -- fetch data from a CD table
    select CD.col1, CD.xcol1, CD.col2, CD.col3, CD.intcol1, CD.intcol2,
    CD.ibmsnap_commitseq, CD.ibmsnap_intentseq, CD.ibmsnap_operation, current timestamp
    from SQLREP_ANY_TARGET_CD_1 CD
    where CD.IBMSNAP_COMMITSEQ >
    (SELECT MIN_REFRESH_COMMITSEQ
    FROM SQLREP.ETL_PROGRESS
    where SQL_CAP_SCHEMA = 'ASN'
    and SOURCE_OWNER = 'CHRIS'
    and SOURCE_TABLE = 'SQLREP_ANY_TARGET_SOURCE_1'
    and TARGET_DBALIAS = 'TARGETDB'
    and TARGET_OWNER = 'CHRIS'
    and TARGET_TABLE = 'SQLREP_ANY_TARGET_TARGET_1'
    )
    and CD.IBMSNAP_COMMITSEQ <=
    (SELECT MAX_REFRESH_COMMITSEQ
    FROM SQLREP.ETL_PROGRESS
    where SQL_CAP_SCHEMA = 'ASN'
    and SOURCE_OWNER = 'CHRIS'
    and SOURCE_TABLE = 'SQLREP_ANY_TARGET_SOURCE_1'
    and TARGET_DBALIAS = 'TARGETDB'
    and TARGET_OWNER = 'CHRIS'
    and TARGET_TABLE = 'SQLREP_ANY_TARGET_TARGET_1'
    )
    order by ibmsnap_commitseq, ibmsnap_intentseq
    with ur
    ;

    SELECT to fetch data from the CD table joined with the unit of work table:

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    -- fetch data from a CD table joined with the unit of work table to add the source
    -- commit timestamp (IBMSNAP_LOGMARKER) to the result set
    select CD.col1, CD.xcol1, CD.col2, CD.col3, CD.intcol1, CD.intcol2,
    CD.ibmsnap_commitseq, CD.ibmsnap_intentseq, CD.ibmsnap_operation, UOW.IBMSNAP_LOGMARKER
    from SQLREP_ANY_TARGET_CD_2 CD
    inner join ASN.IBMSNAP_UOW UOW
    on CD.IBMSNAP_COMMITSEQ = UOW.IBMSNAP_COMMITSEQ
    where CD.IBMSNAP_COMMITSEQ >
    (SELECT MIN_REFRESH_COMMITSEQ
    FROM SQLREP.ETL_PROGRESS
    where SQL_CAP_SCHEMA = 'ASN'
    and SOURCE_OWNER = 'CHRIS'
    and SOURCE_TABLE = 'SQLREP_ANY_TARGET_SOURCE_1'
    and TARGET_DBALIAS = 'TARGETDB'
    and TARGET_OWNER = 'CHRIS'
    and TARGET_TABLE = 'SQLREP_ANY_TARGET_TARGET_1'
    )
    and CD.IBMSNAP_COMMITSEQ <=
    (SELECT MAX_REFRESH_COMMITSEQ
    FROM SQLREP.ETL_PROGRESS
    where SQL_CAP_SCHEMA = 'ASN'
    and SOURCE_OWNER = 'CHRIS'
    and SOURCE_TABLE = 'SQLREP_ANY_TARGET_SOURCE_1'
    and TARGET_DBALIAS = 'TARGETDB'
    and TARGET_OWNER = 'CHRIS'
    and TARGET_TABLE = 'SQLREP_ANY_TARGET_TARGET_1'
    )
    order by ibmsnap_commitseq, ibmsnap_intentseq
    with ur
    ;

    5) ETL post-processing

    Only when the ETL step was successful, the lower bound for the next cycle is advanced:

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    update SQLREP.ETL_PROGRESS P set
    MIN_REFRESH_COMMITSEQ = MAX_REFRESH_COMMITSEQ,
    MIN_REFRESH_LOGMARKER = MAX_REFRESH_LOGMARKER
    where SQL_CAP_SCHEMA = 'ASN'
    and SOURCE_OWNER = 'CHRIS'
    and SOURCE_TABLE = 'SQLREP_ANY_TARGET_SOURCE_1'
    and TARGET_DBALIAS = 'TARGETDB'
    and TARGET_OWNER = 'CHRIS'
    and TARGET_TABLE = 'SQLREP_ANY_TARGET_TARGET_1'
    ;

    update SQLREP.ETL_PROGRESS P set
    MIN_REFRESH_COMMITSEQ = MAX_REFRESH_COMMITSEQ,
    MIN_REFRESH_LOGMARKER = MAX_REFRESH_LOGMARKER
    where SQL_CAP_SCHEMA = 'ASN'
    and SOURCE_OWNER = 'CHRIS'
    and SOURCE_TABLE = 'SQLREP_ANY_TARGET_SOURCE_2'
    and TARGET_DBALIAS = 'TARGETDB'
    and TARGET_OWNER = 'CHRIS'
    and TARGET_TABLE = 'SQLREP_ANY_TARGET_TARGET_2'
    ;

    6) CD table pruning

    The Capture process regularly deletes rows from the CD and UOW tables which already have been forwarded to all targets and which are no longer required. This process is called ‘pruning’. To enable pruning, the replication progress has to be recorded in the PRUNE_SET table.

    If all CD tables that belong to the same ETL process and the same pseudo subscription have been read and forwarded successfully, the MIN_REFRESH_COMMITSEQ can be set as pruning synchpoint in table IBMSNAP_PRUNE_SET. When Capture starts the next pruning cycle in background, it will – for each CD table – determine a save pruning LSN by comparing the pruning synchpoints of all processes (subscriptions) reading from the same CD table. Capture will delete all rows from all CD tables and from the UOW table which are no longer required by SQL Replication Apply or any ETL process.

    Update the pruning synchpoint for the pseudo subscription like this:

    -- Connect to the SQL Capture Server
    connect to sourcedb;

    update asn.ibmsnap_prune_set
    set (SYNCHPOINT, SYNCHTIME) =
    (SELECT MIN(MIN_REFRESH_COMMITSEQ), MIN(MIN_REFRESH_LOGMARKER)
    from SQLREP.ETL_PROGRESS
    WHERE SQL_CAP_SCHEMA = 'ASN')
    WHERE APPLY_QUAL = 'ETL'
    AND SET_NAME = 'ANY_TARGET'
    ;
  9. Repeat

    The above steps 3-6 can be executed as often as required. The more often, the lower the latency of the final data destination. Remember that it does not make sense to execute the sequence more often than Capture’s COMMIT_INTERVAL.

  10. Summary

    This article focused on how to extend IBM SQL Replication to any target by reading data directly from SQL Replication CD tables. Although CD tables were not originally intended as external API, they perfectly can be used as such in a simple way respecting the SQL Replication protocol as described.

    The recipe explained in detail how to get save ranges (upper limit, lower limit) for fetching data from SQL Replication CD tables in a micro-batching fashion. The recipe contained an exemplary documentation how to register Db2 database tables as SQL Replication sources, how to define a pseudo subscription for those tables, and how to leverage CD tables to forward the captured data from the CD tables to any target.

Join The Discussion