Overview

Skill Level: Beginner

Learn how to set up an command line environment which allows you to tailor extraction cartridges according to your needs for StoredIQ step-up analysis

Ingredients

This recipe explains SystemT-based cartridges for StoredIQ. You will learn how to modify, configure, compile, and build these cartridges using command line facilities. We will refer to the following online ressources: 


IBM StoredIQ uses cartridges to extract information pieces such as bank account numbers, person names, or postal addresses from natural language documents. When extracted, StoredIQ makes these pieces searchable through the StoredIQ's content indexes, provisioning them as possible query terms. In the series of steps in this recipe, you start with setting up a cartridge development environment for the command line, which will be called CartDevEnv for short. Then, you will  download and deploy the current version of a StoredIQ Advanced Cartridge into the CartDevEnv, change the cartridge's configuration, code, and assets, then compile and re-package it so that you will obtain a fully-functional modified Advanced Cartridge. 


We will assume that you are working with a Unix terminal e.g. on Linux or MacOs. Note however that the CartDevEnv can be used in Windows environments as well, where command line setup and the commands to use can differ from the ones described. To indicate commands and other system elements, italic typeface is used throughout this article.

Step-by-step

  1. Set up the CartDevEnv

    Download the file CartDevEnv.zip from the latest StoredIQ version on Fix Central. Unpack it to a newly created directory, say ./SiqDevEnv/, which should have > 10 GB free space attached. Then, as described in the dW recipe on RegEx Cartridges, Chapter 4, go to a StoredIQ data server, download the contents of its /usr/local/tomcat/webapps/storediq directory as a file siq_storediq.tgz, and extract its contents to a newly created directory named webapps under the  ./SiqDevEnv/ directory, that is ./SiqDevEnv/webapps.

     

  2. Install a Cartridge into CartDevEnv

    When you extract the CartDevEnv package as described, the directory ./SiqDevEnv/webapps/storediq/WEB-INF/cartridges/ will be created implicitly. Now download one of the two version 2.4 cartridges from Fix Central, say GDPRFocusedDataDiscoveryAdvanced.car  (as described for the CartDevEnv.zip file), save it into the ./cartridges/ directory, and create a new directory ./GDPRFocusedDataDiscoveryAdvanced. We will refer to this new directory as the cartridge directory.  

    Navigate to the cartridge directory and unpack the archive file by running the following command: 

    unzip ./GDPRFocusedDataDiscoveryAdvanced.car -d ./GDPRFocusedDataDiscoveryAdvanced/

    Now, edit the file ./SiqDevEnv/webapps/storediq/WEB-INF/uima/config/specifiers/search_document_cartridges.xml  and make sure it contains the stanza: 

    ¬†¬†¬†¬† <delegateAnalysisEngine key=”GDPRFocusedDataDiscoveryAdvanced”>
    ¬†¬†¬†¬† <import location=”../../../cartridges/GDPRFocusedDataDiscoveryAdvanced/cartridgedescriptor.xml”/>
          </delegateAnalysisEngine>

    Note that the UIMA cartridge descriptor file cartridgedescriptor.xml is referred to not through a path setting in the development environment, but through a path relative to the location of the search_document_cartridges.xml  file. Also, cartridges in the ./cartridges/ directory which are not referred to in the search_document_cartridges.xml file will also be loaded through the current UIMA setup and should represent valid code, therefore.   

     

  3. Change the Cartridge Configuration

    The file cartridgedescriptor.xml  defines the UIMA configuration of a StoredIQ cartridge, in this case the SystemT-based Advanced Cartridge. SystemT provides extractor functionality through so-called AQL modules, in a text analysis language named AQL, each of which can contain rule files and dictionaries. For detailed information on AQL, please see the Knowledge Center on Text Analytics. In this section, we will explain how to configure the right set of AQL modules for a given extraction task. 

    First, take a look at the XML element named ModulesToLoad  in the cartridgedescriptor.xml  file, as shown below. 

    <nameValuePair>
                    <name>ModulesToLoad</name>
                    <value> <array>
                            <string>AddressExtractCommon</string>
    ¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† …..
    </array> </value>
     </nameValuePair>

    This element lists all AQL modules that are to be executed during a cartridge run. Therefore, its effect on cartridge performance is linear to the amount of text being processed. The executable for a text analysis module is a so-called TAM file which can be found under the path indicated in the CompileOutputURI  element. TAM files are generated from the module through a compilation step (see section 5).  The Advanced Cartridge as downloaded contains all TAM files needed, including those of the Basic Cartridge. 

    Therefore, through un-commenting the following modules from the Advanced Cartridge’s ModulesToLoad element, you can enhance it with all of the Basic Cartridge’s functionality. Running all extractor functionality in a single cartridge brings a small performance benefit linear to the amount of text processed.¬†

                <string>BankAccountNumbers</string>
                <string>EmailAndIPAdresses</string> 
                <string>InternationalNumbers</string> 
                <string>dates_international</string>
                <string>numbers</string>                
                <string>PhoneNumbers</string>

     

  4. Editing AQL modules

    In the directory ./GDPRFocusedDataDiscoveryAdvanced/aql,  there is a list of AQL modules, each containing rule source files with the extension .aql, and/or dictionary files .dict . For information about working with this code, please see the Knowledge Center on Text Analytics. Following are some examples of possible enhancements to the AQL modules:

     

    • Add an extractor for customer numbers through copying and modifying one of the AQL files in the numbers module.
    • Add an extractor for postal addresses of a country other than UK, France, and Britain through copying, pasting, and modifying one of the postal address modules. (Note that postal addresses for the US and Germany are covered in the Advanced cartridge already, but cannot be modified).
    • Add local phone numbers (without a country prefix such as +43 or 0043) through adding a regex extraction view to the respective national AQL file in InternationalNumbers.

    In your AQL code, you can output a text analysis view ViewName through adding the following statement to your AQL file: 

    ¬†¬†¬†¬†¬† output view ViewName as ‘com.ibm.systemT.ViewName’;
     
    In this example, com.ibm.systemT.ViewName is an alias. The StoredIQ cartridges require these aliases (as well as any output views) to be registered UIMA types. A view name or alias can be registered through adding it into the hierarchy of the UIMA type definition files namedentity_typesystem.xml (for string-type views) and numbers_typesystem.xml  (for numeric views) in the cartridge directory. For example, if you add an AQL view or alias GDPRFocusedDataDiscoveryBasic.CustomerNumber, you should also add the following UIMA type definition element to the numbers_typesystem.xml file: 

        <typeDescription>
                    <name>GDPRFocusedDataDiscoveryBasic.CustomerNumber</name>
                     <description>Customer number of company xyz</description>
                    <supertypeName>GDPRFocusedDataDiscoveryBasic.Number </supertypeName>
             </typeDescription>

     

  5. Compile

    As soon as you have changed the configuration or AQL code of your cartridge, the correctness of your code including UIMA typing an be checked and TAM files generated through the cartridge compiler. This compiler is run from the ./SiqDevEnv directory: 

        bin/cartcompiler.sh 

    For defining the scope of the cartridge compiler, go to the ModulesToCompile  element within the cartridgedescriptor.xml file. The cartridge compiler will process all modules listed in this element, for which no TAM file is found: 

    <nameValuePair>
                    <name>ModulesToCompile</name>
                    <value> <array>
                            <string>GDPRFocusedDataDiscoveryAdvanced/aql/uimaoutput</string>
    ¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† ……..
           </array> </value>
            </nameValuePair>

    Note that the compiler will write a TAM file to the CompileOutputURI  directory only if its modules appear in the ModulesToLoad element described above. All modules processed through the cartridge compiler are checked for AQL correctness and for compliance with the UIMA type system specified. In the error case, problems found in the UIMA AQL wrapper code can be easily distinguished from problems detected through the AQL compiler. A missing UIMA type definition, for example, will result in a runtime exception like the following: 

    Caused by: java.lang.RuntimeException: No UIMA type information for output com.ibm.systemT.GenderIndicator at com.ibm.avatar.algebra.util.uima.UIMAWrapper.typeSystemInit(UIMAWrapper.java:597)
     
    The following is an example of a compiler error in the AQL code, which is being described with AQL file name and line/column number: 

    Caused by: com.ibm.avatar.api.exceptions.CompilerException: Compiling AQL encountered 1 errors: 
    In /home/gsr/Documents/Products/StoredIQ/Cartridges/DENew/webapps/storediq/WEB-INF/cartridges/GDPRFocusedDataDiscoveryAdvanced/aql/BankAccountNumbers/BankAccountConsolidation.aql: At line 12, column 55 Encountered <EOF>. Was expecting a: “String constant”¬†

    Since the compilation step adds to the overall execution time, depending of course on the number and complexity of the AQL modules, make sure that all modules from the ModulesToCompile element for which a valid TAM file exists are removed (or commented out) from the ModulesToCompile list. This is even more important once you build a new cartridge (see section 7) to deploy it into StoredIQ.   

  6. Testing Cartridge Changes

    Good cartridges require careful testing with test data being taken whenever possible from the same data collection the final cartridge is to be applied to. Test data should reflect the parameters language, ASCII formatting, context of hits, all UIMA types, and all AQL views to be tested. As to formatting, remember that there is no binary deformatting, for example from a .ppt document to its ASCII format within the CartDevEnv. In some cases, it is unavoidable to add test output statements to the AQL code, which must be reflected in corresponding (test-wise) UIMA type definitions, of course.

    There are the following command line tools available in the CartDevEnv’s bin directory for different testing needs, namely:¬†

    • batchdocumentprocessor_sample.sh – for testing large document sets
    • batchregressiontester_sample.sh – for regression testing with tabular input
    • cvd.sh – for testing interactively small amounts of documents
    • batchresultanalysis.sh – for comparing two .csv output files of batchdocumentprocessor_sample.sh runs

    The first three tools are described in the chapters 6, 7, and 8 of the dW recipe on RegEx Cartridges. The tool batchresultanalysis.sh is called from the ./SiqDevEnv directory with three parameters, as follows:

    • two .csv files as produced by running the batchdocumentprocessor_sample.sh tool – an “old” one, and a “new” one, and
    • a directory reference where the results of the comparison will be written to.¬†

    Here is an example of a call for the results directory ./SiqDevEnv/testAC: 
     
    bin/batchresultanalysis.sh ./testAC/bdpOld.csv ./testAC/bdpNew.csv ./testAC/

    Note that the batchresultanalysis.sh tool can run for a while and will overwrite the result files produced from a previous run without warning. The output file, in the example above ./SiqDevEnv/testAC/resulttable.csv, will contain descriptive statistics for the two .csv files, and for the non-empty difference sets Old-New, and New-Old both on the type and document level. 

  7. Build a New Cartridge

    As we have seen above, the .car cartridge file is in fact a .zip file. If you want to run your cartridge as a new cartridge version within StoredIQ (see the StoredIQ Knowledge Center on Cartridges), please increment the version property in the cartridge.properties file.¬† Then, the cartridge’s .car file can be created through executing the following command from the cartridge directory:¬†

    zip -r ../GDPRFocusedDataDiscoveryAdvanced.car ./* 

    This command will overwrite a previous car file version in the CartDevEnv. The .car file can then be used for a cartridge upload or upgrade to StoredIQ, as described in the StoredIQ Knowledge Center. Since any customer-specific modifications of the cartridge will be overwritten when the cartridge is being updated, it is best practice to rename cartridges with such modifications.

6 comments on"How to Modify SystemT-based Cartridges for StoredIQ"

  1. Hello Dr. Sebastian

    Thomas Hampp’s dW Recipe on RegEx Cartridges has an example how to add your own Java-based validation logic (checksum calculation for IDs in particular) to the cartridge. This is very essential feature in fight with false positives.
    Is Java-based validation possible in SystemT cartridge implementation? Could you give an example how to do it? Thank you.

    • Dr. Sebastian Goeser February 26, 2019

      Hi Andrejs,

      I perfectly agree. The programmatic means for AQL which would correspond to the validation in UIMA regex annotators is so-called UDFs. While (and because) UDFs are much more powerful than these validators, their integration into AQL module code requires some more consideration and, if you will, tooling. We are working on this extension and will update this recipe as soon as we possibly can.

  2. One more question Dr. Sebastian.

    Previous generation of Basic cartridge (regex based) uses to switch on or off rules from execution based on the document language.
    For example, by putting En (English) as an language in the rule which detect Latvian National IDs I was able to force StoredIQ to look for Latvian National ID patterns in English documents (or in documents for which language cannot be detected).

    Is the same possible with new SystemT based cartridge and how can I achieve this?

    Thank you.

  3. Dr. Sebastian Goeser April 15, 2019

    Now there is a dW recipe on cartridge development including UDFs: https://developer.ibm.com/recipes/tutorials/how-to-create-custom-cartridges-for-storediq/

Join The Discussion