Overview

Skill Level: Beginner

Starting with version 7.6.0.12, StoredIQ supports analytic plugins called cartridges. This article will enable customers, partners and IBM field personnel to to create cartridges for this feature.
This article has been updated for StoredIQ 7.6.0.13

Ingredients

  • Basic knowledge of regular expression programming¬†
  • Basic StoredIQ knowledge
  • Access to an installation of StoredIQ version 7.6.0.13 or later
    • Note: The description and tools in the article no longer apply to and work with¬†7.6.0.12

Step-by-step

  1. What are cartridges (good for)?

    Starting with version 7.6.0.12, StoredIQ supports analytic plugins called cartridges. This article has been adjusted for version 7.6.0.13 and requires it as the minimal version. Cartridges are basically compressed files containing additional analysis logic that can be uploaded to StoredIQ. Cartridges can contain analysis logic based on different technologies ranging from simple regular expressions (cartridge type regex) to full blown cognitive approaches like natural language processing (NLP). By adding a cartridge to the StoredIQ application stack, you enable StoredIQ to find and index new items in documents, thus making them searchable. For example, a sensitive patterns cartridge for the General Data Protection Regulation (GDPR) can enable StoredIQ to detect passport numbers, phone numbers, and other IDs. For applying the logic of a selected cartridge to an infoset, StoredIQ introduces a new action type: step-up analytics . When you run this action, StoredIQ examines all documents in the infoset, applies the analytics contained in the cartridge to the documents, and then stores the analysis results in the StoredIQ index.

    StoredIQ 7.6.0.13 provides two country specific cartridges for major EU countries to detect key sensitive information types. One cartridge detects basic ids such as the national ID number, bank account number, passport number, IP address, tax ID, etc. A second cartridge uses cognitive, natural language processing (NLP) modules to detect person names, organization and company names, locations and addresses. You can download these two cartridges from FixCentral.

    You can create new cartridges of type regex and customize the regex cartridge provided with StoredIQ 7.6.0.13 for specific use cases. This article will explain how to do that.

     

     

  2. What does a regex cartridge contain?

    Technically, cartridges are compressed file with a .car extension similar to the .jar, .ear, or .war files used in software development. Technically they are ZIP files with an alternate file extension. Regex cartridges are at their heart just a set of configuration files for an instance of the Apache UIMA Regex Annotator.

    A StoredIQ regex cartridge must contain the following files:

    1. A LICENSE.TXT file that contains the license terms and conditions for the cartridge.
    2. A cartridge.properties file with some high level information about the cartridge.
    3. A cartridgedescriptor.xml file that is the UIMA analysis engine descriptor for the cartridge.
    4. A file that has the actual regex rules in the XML format specified by the Apache UIMA Regex Annotator. The name is configured within the cartridgedescriptor.xml file. We will be using the name concepts.xml in the following examples.

    A StoredIQ regex cartridge can contain an arbitrary number of other files in any nesting of subdirectories. Typically, this is not needed for regex cartridges but can be useful, for example, to split a long concept.xml file into two separate files. All files contained in a cartridge will be deployed in StoredIQ.

    Let’s have a quick look at each of the four file types in the context of a concrete example. There is a sample cartridge for illustration purposes that detects fictitious customer IDs. Because every company has different customer and employee IDs, this can’t be provided in any out-of-the-box cartridge. The sample is as minimal and simple as possible. You can download the SampleCartridge.car file following this link, rename it to a .zip extension and unpack it to a local directory. The following overview describes each file in a regex cartridge.

    File 1: LICENSE.TXT
    This is a plain text file that can have any content or just be empty. The cartridge developer can provide specific license terms for using the cartridge. When a cartridge is uploaded, you must agree to the license terms in the StoredIQ Administrator UI:

    1-license

    File 2: cartridge.properties
    This file is a simple properties file where each line is an attribute=value setting. Comment lines are prefixed with #. The cartridge.properties file provides the StoredIQ administrator installing the cartridge with the necessary information about the cartridge. The following properties are available for a regex cartridge:

    • name: Required. The name is case sensitive. It must be unique within all cartridges deployed to a StoredIQ instance, and must not contain any blanks or special characters.
    • description: Optional. One or two sentences describing the purpose of the cartridge. The description will be displayed in the StoredIQ Administrator UI.
    • version: Optional. Future versions of StoredIQ will allow to update cartridges with new versions.
    • languagesSupported: Optional. A list of ISO 639-1 two-letter, lowercase language codes representing the languages this cartridge is intended for. This setting is for informational purposes only. The actual language settings for the cartridge are defined in the cartridgedescriptor.xml UIMA descriptor. For details, see the cartridgedescriptor.xml file description.
    • supportedResults.*: One or more entries. Each entry defines a new StoredIQ search filter term that will be made available by the cartridge as indexed annotation ia:filter term. Each result definition maps a search term to an UIMA type. For example, the entry supportedResults.customerid=SampleCartridgeTypes.CustomerNumber defines a new filter ia:customerid which is mapped to the UIMA type SampleCartridgeTypes.CustomerNumber. This UIMA type is defined in the cartridgedescriptor.xml file and used in the regex rules defined in the concepts.xml file.

    Sample cartridge.properties file:

    name=SampleCartridge
    description=Sample cartridge for illustration purposes that detects fictitious customer IDs
    version=1.0
    supportedResults.customerid=SampleCartridgeTypes.CustomerNumber

    These settings will be shown in the StoredIQ Administrator UI as follows:

    CartridgepropertiesInSiq

    File 3: cartridgedescriptor.xml
    The following definitions are important in this configuration file:

    1) Which concepts XML file is used. The concepts XML is set in the following section:

    <nameValuePair>
    <name>ConceptFiles</name>
    <value>
    <array>
    <string>SampleCartridge/concepts.xml</string>
    </array>
    </value>
    </nameValuePair>

    Note that the name of the cartridge is part of the entry in the <string> section. If you change the cartridge name, you must adjust this setting as well. You can have more than one <string> entry in this <array> section.

    You need to use a relative path here. The names are resolved using the Java classpath setting. The StoredIQ runtime will add the parent directory under which all cartridges are unpacked into the classpath. Each cartridge is deployed (that is unpacked and copied) to a directory named like the cartridge. That is why the name of the cartridge needs to precede the name of the concept XML for successful resolution of the relative file name. If the cartridge is renamed the directory name will change and this setting needs to be adjusted to the new name.

    2) What UIMA types are defined. The UIMA types are set in the following section:

    <typeDescription>
    <name>SampleCartridgeTypes.CustomerNumber</name>
    <description/>
    <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>

    Each concept that your regexes detect requires a separate <typeDescription> entry.
    You can have any number of them. The <name> entry can be any string. It is convention though to use a dot notation indicating namespaces. As a best practice, choose a common namespace for all types in your company or project. For simple cases, just use <supertypeName>uima.tcas.Annotation</supertypeName> for all types.

    3) What languages the cartridge should work on (and which ones they should not):

    <languagesSupported>
    <language>x-unspecified</language>
    </languagesSupported>

    You can have any number of <language> entries. Languages are specified as ISO 639-1 two-letter, lowercase codes. The special term x-unspecified is stating that the regex rules in the cartridge can work on documents in any language. To state that the cartridge should only work on English, German, and French, use a statement like this:

    <languagesSupported>
    <language>en</language>
    <language>de</language>
    <language>fr</language>
    </languagesSupported>

    Unlike the languagesSupported property in the cartridge.properties file, this setting is not just informational. As a best practice start with using <language>x-unspecified</language> until you find serious issues where the cartridge does not work well on documents written in other languages.
    For details about appropriate language settings, see the section on language detection and annotator skipping at the end of this article.

    For more details about UIMA analysis engine descriptors and UIMA types, see the UIMA documentation.

    File 4: concepts.xml
    This file has the actual regex rules in the XML format specified by the Apache UIMA Regex Annotator.
    The concept rule in the following example detects fictitious customer numbers like this DE-12-3456-78 or XY/09/8765/43

    <concept name="SampleCustomerNumber">
    <rules>
    <rule regEx="(?&lt;=\s|,|;|\A)[A-Z]{2}([\-\/])\d{2}\1\d{4}\1\d{2}(?=\s|,|;|\Z)"
    matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation" confidence="1.0">
    </rule>
    </rules>
    <createAnnotations>
    <annotation id="CustNbr" type="SampleCartridgeTypes.CustomerNumber">
    <begin group="0" />
    <end group="0" />
    </annotation>
    </createAnnotations>
    </concept>

    The first bold regEx attribute has the actual regex. Note that this is within an XML file, so certain regex characters like the < character need to be escaped as shown for &lt; in the example. The regex has the actual number pattern [A-Z]{2}([/-//])/d{2}/1/d{4}/1/d{2} in the middle with some look-ahead and look-behind to the left and right to get the boundaries right to avoid matches of this pattern in the middle of a longer pattern. You cannot just use /b as boundary because the switch from numeric to non-numeric is considered such a boundary. Therefore, the more complex boundary pattern /s|,|;|/Z that allows for white space, comma, semicolon, and end of document must be used.

    The second bold type attribute is important. It must specify a type that is defined in the cartridgedescriptor.xml file in a <typeDescription> section. This is also the type used in the cartridge.properties file in a supportedResults setting, for example, supportedResults.customerid=SampleCartridgeTypes.CustomerNumber. With that setting you are able to define a search filter in StoredIQ ia:customerid once you deployed the cartridge.

    The third bold¬†begin and end group elements¬†specify which match group or groups of the regex are used to determine the begin and the end of the annotation. Group 0 specifies the whole match (but remember that regex allows to specify “non-capturing” groups like the look ahead/behind used in the example). Very often, look ahead and look behind are too complicated, too slow, or too limited. In this case, use a regular expression that checks for your target in context and use a match group to limit the annotation to a submatch. A typical approach would be to have a pattern like (left-context-pattern)(actual-match-pattern)(right-context-pattern). With that approach, the match group for the actual target is match group number 2. So you would use the following configuration to make sure the annotation covers only the middle part.

    <begin group="2" />
    <end group="2" />

    This concludes the overview of the four file types that are required in a regex cartridge.

    An optional set of files type that can be added to a regex cartridge are files for Java-based validation logic. This is useful in cases where regex patterns need to be augmented with checksum logic to improve their precision. The general approach for this is described in the docmumentation of the Apache UIMA Regex Annotator.

    Here is an example of such validation logic for British NHS ids:

    package com.ibm.siq.personaldatavalidators;

    import org.apache.uima.annotator.regex.extension.Validation;

    public class NHRValidation implements Validation {

    /**
    * Validator for British NHS checksums as described here:
    * //https://en.wikipedia.org/wiki/NHS_number
    */
    @Override
    public boolean validate(String coveredText, String ruleID) throws Exception {
    // get rid of anything but the digits
    coveredText=coveredText.replaceAll("[^\\d]", "");
    if (coveredText.length() < 9 || coveredText.length() > 10) {
    // valid NHS numbers must be 9 -without checksum- or 10 digits long
    return false;
    }
    int checksum = 0;
    for (int i = 1; i <= 9; i++) {
    int d = coveredText.charAt(i-1) - (int)'0';
    //Ignoring the check digit, each of the first 9 digits is multiplied by 11 minus its position
    checksum = checksum +(d * (11-i));
    //The remainder when dividing this number by 11 is calculated
    checksum = checksum % 11;
    }
    //Finally, this number is subtracted from 11
    checksum = 11 - checksum;
    //A checksum of 11 is replaced by 0 in the final NHS number
    if (checksum == 11) {
    checksum = 0;
    }
    //If the checksum is 10 then the NHS number is not valid.
    if (checksum == 10) {
    return false;
    }
    if (coveredText.length() == 9) {
    //without checksum there is nothing we can validate so let's assume it's OK
    //but there is still value, since a checksum of 10 can't exist
    return true;
    }
    int checkdigit = (int)coveredText.charAt(9) - (int)'0';
    return (checksum == checkdigit);
    }

    // Only used for local tests
    public static void main(String[] args) {
    NHRValidation v = new NHRValidation();
    try {
    System.out.println(v.validate("943-476-5919",""));
    System.out.println(v.validate("505-798-6424\n",""));
    System.out.println(v.validate("713-853-7107",""));
    System.out.println(v.validate("713-463-8626",""));
    } catch (Exception e) {
    e.printStackTrace();
    }
    }
    }

    The code above needs to be compiled into a .class file. In this case into NHRValidation.class which -following Java conventions- needs to go into a directory that corresponds to the package com.ibm.siq.personaldatavalidators. For this example the directory structure would be com/ibm/siq/personaldatavalidators. This directory needs to be directly embedded into the .car file since the cartridge directory will be added to the Java classpath. There is no need to create a .jar file. Only .class file embedded with their directory structures are supported in StoredIQ regex cartridges.

    To use this Java logic in a regex rule can reference the class in a “validate” attribute as shown here:

     <concept name="GBR_NationalID_NHS">
    <rules>
    <rule regEx="(\s|,|;|\A)(\d{3}([ -]?)\d{3}\3\d{4})(\s|,|;|\Z)" matchStrategy="matchAll" matchType="uima.tt.SentenceAnnotation" confidence="1.0"/>
    </rules>
    <createAnnotations>
    <annotation id="GBR_NationalID_NHS" type="GDPRFocusedDataDiscoveryBasic.GBR_NationalID_NHS"
    validate="com.ibm.siq.personaldatavalidators.NHRValidation">
    <begin group="2" />
    <end group="2" />
    </annotation>
    </createAnnotations>
    </concept>

     

  3. Deploying the cartridge to StoredIQ

    Uploading cartridges is documented in the StoredIQ documentation. But as a cartridge developer, you might welcome some additional tips, tricks, and best practices. toredIQ 7.6.0.12 had a limitation which, for example, does not support updating (and deleting) cartridges. This limitation has been removed with StoredIQ 7.6.0.13. You can now update and delete cartridges. But note that deleting a cartridge is not possible while it is used in active infosets.

    During cartridge development, typically lots of different versions are created. In later sections of this article you will see how you can use a local cartridge development environment that allows you to test out cartridges without having to upload them after each change. However, several development versions are still likely to be created for upload to StoredIQ.

     

    To upload a cartridge follow the steps as shown here;

    3-uploadNewCartridgeVersion-1

    Make sure you have an action created that uses the cartridge.

    4-showNewCartridgeVersion

    Give it a name.

    5-EditNewCartridgeVersion

    In the second window, select the cartridge. Then, click Save.

    6-EditAction

     

    Best practices for cartridges and analytic step-up actions

    You can have multiple cartridges assigned to one analytic step-up action. You can even have the same cartridge assigned to several different step-up actions. While this provides for a variety of combinations, first and foremost, you should consider who is going to receive the resulting data and what the data will be used for.
    Cartridges are technical units of deployment and used to make new analytics functionality available in StoredIQ. Within StoredIQ, cartridges are visible only to and maintained by the administrator.

    Step-up analytics actions are how analytics are applied to data in infosets to achieve a given purpose or gain a given insight. They are created by the StoredIQ administrator but for use by the data expert. The only thing that a data expert sees and cares about is the data resulting from a cartridge: the list of indexed annotation ia: filter terms defined by the cartridge. But these terms are not necessarily cartridge-specific. You could have multiple cartridges define the same ia: filter term. For example, you could have two cartridges defining the same ia:DateOfBirth filter term. One cartridge could do it for French, the other for German or Japanese. One might miss DateOfBirth patterns in spreadsheets with their column structure; a second one could then be set up just to provide good DateOfBirth detection for spreadsheets.

    As a best practice, one step-up analytics action should have one single clear purpose and define the set of ia: filter terms needed to achieve that purpose. The action can very well contain multiple cartridges for that. For example, the two cartridges provided with StoredIQ 7.6.0.13 (GDPRFocusedDataDiscoveryBasic and GDPRFocusedDataDiscoveryAdvanced) have both been developed for the same purpose: to detect entities that are important to find personal data for GDPR. So, both would in most cases go into the same step-up analytics action, which could be called “Analyze data to detect key personal data”.
    If you have other cartridges that contain analytics for a different purpose, they would go into a different step-up analytic action. For example, a cartridge that can identify complaint documents by looking for complaint indicator words could go into a step-up action called “Analyze data to detect complaints”.

  4. Creating a local cartridge development environment outside of StoredIQ

    During the development of a cartridge, it is inconvenient to just try out new, improved versions of the cartridge by uploading each new version. For quicker turnaround, more detailed insight, and error tracking, it is important to have a way to run cartridges on selected text content without having to go through cartridge deployment, analytic step-up, and an ia: search cycle in StoredIQ.

    The following approach will show how cartridges can be developed and tested locally.

    For the purpose of this article we assume that you already deployed the two StoredIQ GDPR cartridges available from FixCentral and also the SampleCartridge introduced previously before you follow the steps here to create a local cartridge environment.

    Download the CartridgeDevDW.zip package and extract its contents to a local directory. You should see a directory CartridgeDevDW. We will refer to this directory as the StoredIQ Cartridge Development Environment Directory, or the SIQ_CART_DEV directory for short, in the rest of this article.

    Find the name or the IP address of one of the data servers of that instance, for example by following these steps:

    1. Create an SSH session to one of the StoredIQ data server (does not matter which one).
    2. Change to the Tomcat web application directory: cd /usr/local/tomcat/webapps
    3. Create an archive of the storediq directory contents: tar cvzf siq_webapp.tgz storediq/
    4. Download the resulting siq_webapp.tgz file to your local machine. Create a local directory named webapps and unpack the tar file to this directory. The storediq directory must be the only directory in the newly created webapps directory.

     

    You should now have the top level directory called “webapps” with the Tomcat directory structure underneath starting with webapps/storediq/WEB-INF. We will refer to yourlocalpath/webapps/storediq/WEB-INF as the SIQ_APP_ROOT directory in the rest of this article.
    Ideally place the downloaded webapps directory within the SIQ_CART_DEV directory. Then, your directory structure should in the end look like this:

    7-directorystructure

    In this configuration, SIQ_APP_ROOT is just a shorthand for SIQ_CART_DEV/webapps/storediq/WEB-INF.
    If it is not possible to locate the webapps directory within the SIQ_CART_DEV directory, define an environment variable SIQ_APP_ROOT pointing to yourlocalpath/webapps/storediq/WEB-INF at the proper location on your local machine.

  5. Deploying a cartridge into the local environment

    Deploying a new cartridges requires two steps:

    1. Installing (that is file lay down) the required files for the cartridge in the right directory
      Cartridge files are installed into the SIQ_APP_ROOT/cartridges subdirectory. To install an existing cartridge where a .car file is available, just rename the .car extension to .zip and and extract the file contents to a subdirectory with the name of the cartridge. To create a new one from scratch, create a subdirectory with the name of the new cartridge under the SIQ_APP_ROOT/cartridges directory. Then, create new files named cartridge.properties, cartridgedescriptor.xml, concepts.xml, and LICENSE.TXT as described in the section about the contents of a regex cartridge.
    2. Configuring the UIMA pipeline in SIQ_APP_ROOT to use the new cartridge
      To configure the UIMA pipeline to start using a new cartridge, you have to edit the main UIMA aggregate analysis engine. To learn more about UIMA concepts like analysis engines and their XML configuration files (aka descriptors), have a look at the UIMA documentation. You can use the UIMA Eclipse tooling to edit the descriptor but in this article we will describe how to edit the XML directly.

     

    For this example let’s assume the name of the new cartridge is SampleCartridge. If your cartridge is named differently, just replace each occurrence of SampleCartridge in the following instructions with the name of your cartridge.

    1. Open the file SIQ_APP_ROOT/uima/config/specifiers/search_document_cartridges.xml in a text or XML editor.
    2. Find the XML section <delegateAnalysisEngineSpecifiers>.
    3. Within the <delegateAnalysisEngineSpecifiers> section, add a <delegateAnalysisEngine> section like shown here:
      <delegateAnalysisEngine key="SampleCartridge">
      <import location="../../../cartridges/SampleCartridge/cartridgedescriptor.xml"/>
      </delegateAnalysisEngine>
    4. Save the search_document_cartridges.xml file.

     

    You are now ready to test your cartridge with the stand-alone tools described in the following sections.

     

  6. Testing the deployed cartridges in the local environment interactively

    Now you are ready to run the UIMA CAS Visual Debugger (CVD), which is a UIMA tool to run and inspect analysis pipelines.

    Open a command shell, change to the SIQ_CART_DEV directory, and execute the bin/cvd.sh script (or bin/cvd.bat on Windows)
    CVD should open with the warning shown here (which you can ignore):

    8-CVD1

    Click OK.

    On the very first use, the screen layout might be different (small upper left and right sections). Drag the section boundaries to look roughly like above.
    On a Mac system, CVD sometimes seems to hang on very first use. In that case, kill the CVD application. Start CVD again this time using the bin/cvd_nodescriptor.sh. Close it and try the bin/cvd.sh again.

    First let’s load a little more interesting text to process. Click File > Open Text File and select $SIQ_CART_DEV/testdata/en/testdoc1.txt

    9-CVD2

    Now click Run > Run Annotator chain for document processing at indexing time – Default.

    10-cvd3

    Under CAS Index Repository, expand AnnotationIndex. This will show all annotations. Let’s focus just on the SampleCartridgeTypes.CustomerNumber and select that one:

    11-cvd4

    You will see a list of all UIMA annotations of our selected type. Select the first annotation:

    12-cvd5

    On the left in the Text section a stretch of text that matches our pattern should now be highlighted:

    13-cvd6

    You can use the left side to focus on other types of results. To only see GDPR basic cartridge results, click GDPRFocusedDataDiscoveryBasic.BaseType (and then cycle through the results in the lower left):

    14-cvd7

    For more information about CVD, check out the Apache UIMA tooling documentation.

    Modifying concept rules and testing them locally

    Cartridge rules can be modified by editing the cartridges/<cartridgename>/concepts.xml file as described in the overview section. To try this with the sample cartridge, open the file $SIQ_APP_DEV/cartridges/SampleCartridge/concepts.xml in a text or XML editor and find the entry¬†<rule regEx=”(?&lt;=/s|,|;|/A)[A-Z]{2}([/-//])/d{2}/1/d{4}/1/d{2}(?=/s|,|;|/Z)”. We could now edit it to match a different pattern. But instead of just changing the existing rule we will add an additional rule. That is helpful if there are different patterns that can¬†find the same concept. You could squeeze them into one rule with regex OR matching but a seperate rule is much easier to maintain. In the example¬†show next¬†a very naive rule was added that detects any sequence of five digits as an additional format¬†of customer number.

     <rules>
    <rule regEx="(?&lt;=\s|,|;|\A)[A-Z]{2}([\-\/])\d{2}\1\d{4}\1\d{2}(?=\s|,|;|\Z)"
    matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation" confidence="1.0">
    </rule>
    <rule regEx="\d{5}" matchStrategy="matchAll"
    matchType="uima.tcas.DocumentAnnotation" confidence="1.0">
    </rule>
    </rules>

    Save the modified concepts.xml file.

    To make CVD realize this change you now need to re-load the main descriptor. The quickest way to accomplish that is by using the “Run|Recently used…” menu and select the search_document_content_aggregates.xml analysis engine.

    16-cvd8

    Alternatively you could simply close and restart CVD.

    Reloading and re-running analysis show us the new rule matching in CVD on the sample text “123456789“. Note that the match ends in the middle of a sequence of digits.

    17-cvd9

    Typically you will want to add boundary markers to avoid matches with longer sequences. Changing the rule to regEx=”/b/d{5}/b” is a good start. After that change it will no longer match “123456789” but once you add a blank after 5 it will match again as shown below:

     18-cvd10

    Using /b might¬†not be selective enough for numeric patterns as it will match on boundaries between numbers and separator characters like dashes or slashes. You would get two matches for “12345-67890” which might¬†not be what you want because¬†this is more likely a longer ID of a different nature. This is why many rules end up using more complex look-ahead and look-behind statements like (<=/s|,|;|/A) and (?=/s|,|;|/Z)

    To introduce a new concept in addition to CustomerNumber three steps are needed:

    1. Creating a UIMA type for the concept in $SIQ_APP_ROOT/cartridges/SampleCartridge/cartridgedescriptor.xml
    2. Creating a concept group in $SIQ_APP_ROOT/cartridges/SampleCartridge/concepts.xml
    3. Adding the new UIMA type to the list of supportedResults in $SIQ_APP_ROOT/cartridges/cartridge.properties

    For step 1. add a new <typeDescription> section like shown in bold below:

    <types>
    <typeDescription>
    <name>SampleCartridgeTypes.CustomerNumber</name>
    <description/>
    <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>
    <typeDescription>
    <name>SampleCartridgeTypes.InsuranceNumber</name>
    <description/>
    <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>
    </types>

    For step 2. add another <concept> section for the SampleInsuranceNumber as shown here:

    <concept name="SampleInsuranceNumber">
    <rules>
    <rule regEx="(\s|,|;|\A)([A-Z]{2}\d{5})(\s|,|;|\Z)" matchStrategy="matchAll"
    matchType="uima.tcas.DocumentAnnotation" confidence="1.0">
    </rule>
    </rules>
    <createAnnotations>
    <annotation id="InsNbr" type="SampleCartridgeTypes.InsuranceNumber">
    <begin group="2" />
    <end group="2" />
    </annotation>
    </createAnnotations>
    </concept>

    This example shows¬†another best practice: It does have boundary markers like (/s|,|;/A_ but they are not written as look-ahead and look-behind (that is no <= and ?=). They are just match groups. Usually that would mean the regex match includes the boundary, which is unwanted. To compensate for that, the <begin group=”2″> and <end group=”2″> entries have been added to create the resulting annotation only limited to the regex match group two which corresponds to the the actual ID pattern in the middle. Look-ahead and look-behind are slower in performance and limited in expressiveness; therefore, it’s better to just use a longer regex including boundaries and limit the result in the <begin> <end> statements.

    For step 3. add a new line as shown below:

    supportedResults.customerid=SampleCartridgeTypes.CustomerNumber
    supportedResults.customerid=SampleCartridgeTypes.InsuranceNumber

    After reloading and rerunning, you should see a result like this (type the input text “AB12345” into CVD):

    21-cvd11

     

    Tip: In the early phase of regex development where the focus is just to get the actual regex pattern right, the cycle described for updating concepts.xml, reloading the descriptor, rerunning analysis can be cumbersome. In that phase it is more efficient to use an interactive regex tool like https://regex101.com/. You can use a tool like that to fine tune the regex and then just copy it over to the concepts.xml file once it does what you want. It’s still important though to validate the final pattern in CVD because¬†an external tool cannot process the details of boundary matches and <begin> and <end> assignments in the very same way as the real StoredIQ runtime that CVD allows you to use.

    22-regex101

    To try out the modified sample cartridge in StoredIQ you need to create a new .car file. Just compress the files in the folder under $SIQ_APP_ROOT/cartridges/SampleCartridge. Make sure that your compression tool creates a flat archive structure without any subdirectories. Change the file extenstion to .car

    With the information explained above you should be able to create your own rules, concepts and even new cartridges using the sample cartridge as a basis.

  7. Regression-testing the deployed cartridges locally in batch mode

    Using a tool like CVD to interactively run a cartridge on a small piece of text is essential. But an interactive tool is not a good choice for automating tests after the rule has been developed and is now being refined. In that phase, automated regression testing is important to make sure the rules still match what they were originally intended to match. Refinement of rules should not introduce changes that undo some of the old, intended behavior.

    For this purpose, a command-line regression test tool is provided with the CartridgeDevDW package. The tool takes a test specification of multiple test patterns and executes them all providing a detailed overview of success and failure for each test pattern.

    The test specification is provided in a CSV file with four columns as shown here:

    23-regress1

    • Column 1 “TestTargetType” contains the type of annotations the regex rule creates. You can find that in your concepts.xml file
    • Column 2 “Language” contains the language to be passed into UIMA. Test strings are very short and might contain only numbers so that automatic language detection will not work reliably. Just set this to the language the rules are intended for.
    • Column 3 “TestString” contains the actual string that is supposed to be matched as “TestTargetType”.
    • Column 4 “TestGoal” allows for negative testing in addition to regular, positive testing: if the value is “MustNotMatch” the program will make sure that NO instance of “TestTargetType” is detected. If it is, the test is counted as failed.

    In realistic¬†document texts, the test strings will occur in the larger context of sentences, paragraphs and whole documents. It’s OK for the test to ignore the larger context and treat each test string as a mini-document. But the immediate context of the words before and after the test string can make a difference. For example, 203.0.113.19 is a valid IP address. But when it’s preceded by (+33). to form a (+33).203.0.113.19, it might actually be a phone number. To take contexts into account, the test specification needs to be accompanied by a second input CSV file providing the contexts to check for. It has the form shown in this example:

    24-regress2

    With this context file, the tool tests for four different contexts. For each context, the following information needs to be provided:

    • Column 1 “LeftContext“: The words to be added to the left of the test string. This will typically end with some white space. White space is not automatically added but must explicitly be entered in the CSV file.
    • Column 2 “RightContext“: The words to be added to the right of test string. This will typically begin with some white space. White space is not automatically added but must explicitly be entered in the CSV file.
    • Column 3 “Comment“: Information about the purpose of the context. This information is used for display purposes.

    For a test string like “203.0.113.19” four tests will be run given the sample context CSV file:

    1. “left context 203.0.113.19 right context”
    2. “LEFT CONTEXT 203.0.113.19 RIGHT CONTEXT”
    3. “left 123 203.0.113.19 456 context”
    4. “left : 203.0.113.19 : context”

     

    Let’s try this with the provided sample:

    In the command shell, change to the SIQ_CART_DEV directory, then enter the following command:
    ./bin/batchregressiontester_sample.sh

    You should see output like this:

    Starting regression test
    Input tests used from: testdata/batchregressiontester_sample.csv
    Contexts used from: testdata/batchregressiontester_contexts.csv
    UIMA descriptor: webapps/storediq/WEB-INF/uima/config/specifiers/search_document_content_aggregates.xml
    Output will be written to: testoutput/batchregressiontester.csv
    Testing all test string in 4 different contexts:
    Testing with regular words lower ASCII space delimited
    Testing with regular words upper ASCII space delimited
    Testing with numeric context space delimited
    Testing with special characters context space delimited

    Test Results:
    Passed: 688
    PassedWithOverlapWarning: 56
    Failed: 12

    Output file: testoutput/batchregressiontester.csv
    Done.

    Now, open the output file testoutput/batchregressiontester.csv in a spreadsheet application. You should see something like this (formatting and alignment have to be applied manually):

    25-regress3

    Excel tip: Use the Data|Filter functionality to get filter columns displayed. Try filtering for failed test to focus on issues. Use right aligned columns to see end of long strings.

    The first four columns just repeat the test specification file. The others have the test result information:

    • Column 5 “TestSuccess”: One of Passed, Failed, PassedWithOverlapWarning, Skipped.
    • Column 7 “OverlappingType(s)”: A comma delimited list of matches of other types overlapping with the test match: A different set of rules claims that the very same piece of text identified to be type X by this rule is identified to be of type Y by the overlapping rule. Sometimes this is OK as some patterns are ambiguous but often this is an indication that detection rules are not specific enough.
    • Column 8 “TextOnWhichTestWasRun”: The mini document text including both context parts.
    • Column 9 “ContextComment”: The comment from the context input file.

    The batchregressiontester_sample.sh script is just a convenience wrapper to call the real tool batchregressiontester.sh with these sample arguments:

    batchregressiontester.sh 
    ${SIQ_CART_DEV}/testdata/batchregressiontester_sample.csv
    ${SIQ_CART_DEV}/testdata/batchregressiontester_contexts.csv
    ${SIQ_CART_DEV}/testoutput/batchregressiontester.csv

    The full list of arguments for the command line tool is:
    batchregressiontester.sh DESCRIPTOR INPUT_CSV CONTEXT_CSV OUTPUT_CSVFILE [UIMA_DATA_PATH]

    The sample test specification provided in the testdata/batchregressiontester_sample.csv file is a subset of the tests used for the GDPRFocusedDataDiscoveryBasic cartridge. You will notice some failures left in there for educational purposes. There are some false positives for NHS IDs which do match the regex pattern but are not valid NHS IDs given the checksum logic. There are also some failures on British passport numbers which work fine in ASCII contexts but not in a numeric context. Have a look at the associated regexes. How could the errors be fixed? Do they even need to be fixed? (Exercise left to the reader)

    It is strongly suggested that each cartridge developer provides a test specification covering all the created output types with multiple positive and negative test strings. The strings should cover all major variations in the regex. Try to get to zero failures. Rerun the tests after each change to concept.xml rules to make sure rule tuning does not invalidate old behavior. If you want other people to tweak your cartridge, you can put a test specification file plus context file that run without failures into your .car file so others get it together with the actual cartridge.

    Tip: Adding numbers and white space in CSVs can be tricky, especially when Excel is used to work with the CSV files. To avoid pitfalls like Excel interpreting a long series of numbers as numeral and replacing a value like “390244699470” with “3,90245E+11”, you need to precede the test value with a tab (/t). Just enclosing the value in quotes is not sufficient to have Excel treat the value as a string and not as a number. In CSV columns starting with a tab or with a quote immediately followed by a tab, the tab will be treated as string by Excel. The batch tools will ignore the tabs as they always treat any CSV values as strings.

  8. Mass-testing the deployed cartridges locally in batch mode

    Using a tool like CVD and regression testing are essential. They are especially helpful to ensure “correctness” of the patterns in the sense that they actually do match the samples you had in mind when composing the regular expressions. In that sense a tool like CVD helps mostly with striving for what is called high recall in pattern recognition.

    But there is a complementary aspect to developing good regex rules where running an interactive tool does not help that much: the rules should not create “noise”, that is, false positives. The regression testing tool allows to specify “MustFail” rules for cases where false positives are anticipated. But it is very hard to anticipate them. Most false positives happen with unanticipated text variations. Reducing false positives is striving for high¬†precision in pattern recognition.

    Precision and recall are often a trade off, and achieving 100% precision and 100% recall together is often impossible. It always depends on what kind of text the rules are run against. Texts that have unanticipated instances often cause low precision (high noise).
    Ideally, you would test the rules against texts that resemble the texts the rules will be run against in production. Often, such texts are not available at rule development time. So to get at least a basic sense about the precision of rules, you should run them against a wide variety of texts.

    The CartridgeDevDW package contains a batch test tool called “batchdocumentprocessor” to help with that. It’s a command line tool that takes a directory with a set of test documents as argument and applies the analytics from all locally deployed cartridges to all the documents in batch mode. It produces a CSV report as output that lists all hits. To be more precise, it lists all instances of UIMA types that it was configured to report on.

    Let’s try this with the provided sample:
    In the command shell, change to the SIQ_CART_DEV directory. then enter the following command:
    ./bin/batchdocumentprocessor_sample.sh

    You should see output like this:

    Output types are: [GDPRFocusedDataDiscoveryBasic.BankAccountNumber, 
    GDPRFocusedDataDiscoveryBasic.NationalIdNumber,
    GDPRFocusedDataDiscoveryBasic.PassportNumber,
    GDPRFocusedDataDiscoveryBasic.PhoneNumber,
    GDPRFocusedDataDiscoveryBasic.IPAddress,
    GDPRFocusedDataDiscoveryBasic.EmailAddress]Output will be written to: testoutput/batchdocumentprocessor.csv
    Processed file testdata/en/testdoc1.txt Language: en
    ...
    Output has been written to: testoutput/batchdocumentprocessor.csv
    Done.

    Now, open the file testoutput/batchdocumentprocessor.csv in a spreadsheet program like Excel. You should see something like this (with some formatting and filtering applied):

    26-masstest1

    It is a so-called “hit list” report with one hit per row. A hit means an occurrence of one of the configured target types in one of the input documents processed. For each hit, the report shows the hit “text” in context and what UIMA type the hit was. In addition, the report contains information about file with the hit and the offset in the file. Spreadsheet filtering allows to quickly select only specific types or files. You can quickly spot false positives by looking at the hit text in context.

    The tool runs recursively over a directory tree. The files need to be in plain text format. XML format works to some degree. PDF, Word, and similar formats do not work. The CartridgeDevDW package contains a directory “testdata” with a plain text sample document. For real tests, add more test data to this directory.

    You can get the full so-called “Enron Corpus” in plain text format (May 7, 2015 Version about 423Mb, tarred and gzipped) from here (the newer PST-based formats won’t work with the tool). This is a large collection of real-world enterprise emails which provides a good test set for unanticipated patterns of all kind. Be aware though that this is a huge amount of data when unpacked. If you unpack the full corpus to $SIQ_CART_DEV/testdata/en you should configure the tool to only process subdirectories within the larger corpus. Otherwise, you will wait a long time for the processing to finish.

    The batchdocumentprocessor_sample.sh script is just a convenience wrapper to call the real tool batchdocumentprocessor.sh with these sample arguments.

    batchdocumentprocessor.sh 
    ${SIQ_CART_DEV}/config/typenames.txt
    ${SIQ_CART_DEV}/testdata/en
    en
    ${SIQ_CART_DEV}/testoutput/batchdocumentprocessor.csv

    The full list of arguments for the command line tool is:
    batchdocumentprocessor.sh TYPE_LIST_FILE INPUT_DIR INPUT_LANGUAGE OUTPUT_CSVFILE

    You can use the batchdocumentprocessor.sh script on other input, for other languages, and other output.
    You can also modify or extend the type list file. Check the provided sample file ${SIQ_CART_DEV}/config/typenames.txt. It contains a list of UIMA types like GDPRFocusedDataDiscoveryBasic.BankAccountNumber, GDPRFocusedDataDiscoveryBasic.NationalIdNumber, and so on.

    Just remove any of the types you’re not interested in, and the output CSV file will no longer contain hits for that type. If you are developing a new cartridge, you can add new types. To test our sample cartridge that searches for customer IDs, for example, we need to add SampleCartridgeTypes.CustomerNumber (or just create a separate typenamessample.txt file that contains only this one type.

  9. Side note on language detection and annotator skipping

    Cartridges may have rules that are specific to one country or language and don’t work well on document in other languages. As an advanced tuning options one can configure that¬†a whole cartridge or just a regex concept rule is executed only on documents that are written in a given language.

    A¬†key prerequisite to this configuration is that StoredIQ needs to automatically detect the language of a document. This is disabled per default (all documents are processed as English). If you need¬†to use language-specific rules you first need to configure¬†automatic documument detection. The StoredIQ documentation has a section on how to do this. As a best pratice you should make sure that the languages listed in you cartridge.properties¬†as values of the languagesSupported attribute are also present in the file siq-findex.properties as values of the attribute index.presetLanguageIDs. For example if you cartridge has¬†languagesSupported=fr,se,en,es,no,it,de,da,nl,fi,pl then you should have something like index.presetLanguageIDs=en,fr,se,es,no,it,de,da,nl,fi,pl. Note the “en” was selected as the first value in the list for¬†index.presetLanguageIDs as this will determine the default language.

    As you can see detecting the language of a document is needed just for proper full-text indexing of mixed-langauge document sets. That is why StoredIQ has introduced the configuration setting above quite a while ago. With cartridge analytics language detection becomes more important because it can be the basis on which certain analytics are skipped. In the following example, the cartridge would not be called unless the language is either English or German or French.

    <languagesSupported>
    <language>en</language>
    <language>de</language>
    <language>fr</language>
    </languagesSupported>

    Because language detection is heuristic, it might have errors and guess the language wrong. This happens especially on short documents or documents with few words but lots of numbers like spreadsheets. On the one hand, unwanted skipping of annotators can be an issue. On the other (positive) hand, specifying a language can help with reducing false positives: if you limit your cartridge for identifying French customer IDs to run only on French documents, you might miss the -admittedly rare- case where a French customer ID is mentioned in an English document. Ultimately, the <languagesSupported> and other language-skipping means like the Match Type Filter of the UIMA Regex Annotator are another ways¬†to balance precision and recall: miss out on as little as possible but don’t get drowned in noise.

    As a best practice, just specify x-unspecified in the UIMA cartridgedescriptor.xml file and don’t include a languagesSupported= line in the cartridge.properties file. This avoids all of the complexities with annotator skipping. Only introduce <languagesSupported> carefully as a last resort if you find bad results (typically false positives) in foreign language documents that you cannot get rid of in any other way.

    In CVD, you can specify the language to be used for the next processing in the “Run” menu. If x-unspecified is configured, in CVD automatic language detection is run just like it is in StoredIQ. If a specific language is specified, automatic language detection is omitted and the language set to the value from the UI. That is more reliable but not what really happens in StoredIQ. To see in CVD what language got detected, have a look in the lower left panel at the DocumentAnnotation which has a language attribute. It will show the language value used.

  10. Source code for the sample test tools

    The CartridgeDevDW package contains the source code for the batch regression tester and the batch document processor in the src directory. They are rudimentary tools provided for illustration purposes that you might want to extend. You can use the sample source for your purposes under an unwarranted AS IS license. See the LICENSE.TXT file for the license terms and conditions. The source will need all $SIQ_APP_ROOT/lib/uima*.jar files to compile. To run it you need more jar files. Check out the scripts at $SIQ_CART_DEV/bin for details.

Join The Discussion