Set up the CartDevEnv
Download the file CartDevEnv.zip from the latest StoredIQ version on Fix Central. Unpack it to a newly created directory, say ./SiqDevEnv/, which should have > 10 GB free space attached. Then, as described in the dW recipe on RegEx Cartridges, Chapter 4, go to a StoredIQ data server, download the contents of its /usr/local/tomcat/webapps/storediq directory as a file siq_storediq.tgz, and extract its contents to a newly created directory named webapps under the ./SiqDevEnv/ directory, that is ./SiqDevEnv/webapps.
Install a Cartridge into CartDevEnv
When you extract the CartDevEnv package as described, the directory ./SiqDevEnv/webapps/storediq/WEB-INF/cartridges/ will be created implicitly. Now download one of the two version 2.4 cartridges from Fix Central, say GDPRFocusedDataDiscoveryAdvanced.car (as described for the CartDevEnv.zip file), save it into the ./cartridges/ directory, and create a new directory ./GDPRFocusedDataDiscoveryAdvanced. We will refer to this new directory as the cartridge directory.
Navigate to the cartridge directory and unpack the archive file by running the following command:
unzip ./GDPRFocusedDataDiscoveryAdvanced.car -d ./GDPRFocusedDataDiscoveryAdvanced/
Now, edit the file ./SiqDevEnv/webapps/storediq/WEB-INF/uima/config/specifiers/search_document_cartridges.xml and make sure it contains the stanza:
Note that the UIMA cartridge descriptor file cartridgedescriptor.xml is referred to not through a path setting in the development environment, but through a path relative to the location of the search_document_cartridges.xml file. Also, cartridges in the ./cartridges/ directory which are not referred to in the search_document_cartridges.xml file will also be loaded through the current UIMA setup and should represent valid code, therefore.
Change the Cartridge Configuration
The file cartridgedescriptor.xml defines the UIMA configuration of a StoredIQ cartridge, in this case the SystemT-based Advanced Cartridge. SystemT provides extractor functionality through so-called AQL modules, in a text analysis language named AQL, each of which can contain rule files and dictionaries. For detailed information on AQL, please see the Knowledge Center on Text Analytics. In this section, we will explain how to configure the right set of AQL modules for a given extraction task.
First, take a look at the XML element named ModulesToLoad in the cartridgedescriptor.xml file, as shown below.
This element lists all AQL modules that are to be executed during a cartridge run. Therefore, its effect on cartridge performance is linear to the amount of text being processed. The executable for a text analysis module is a so-called TAM file which can be found under the path indicated in the CompileOutputURI element. TAM files are generated from the module through a compilation step (see section 5). The Advanced Cartridge as downloaded contains all TAM files needed, including those of the Basic Cartridge.
Therefore, through un-commenting the following modules from the Advanced Cartridge’s ModulesToLoad element, you can enhance it with all of the Basic Cartridge’s functionality. Running all extractor functionality in a single cartridge brings a small performance benefit linear to the amount of text processed.
Editing AQL modules
In the directory ./GDPRFocusedDataDiscoveryAdvanced/aql, there is a list of AQL modules, each containing rule source files with the extension .aql, and/or dictionary files .dict . For information about working with this code, please see the Knowledge Center on Text Analytics. Following are some examples of possible enhancements to the AQL modules:
- Add an extractor for customer numbers through copying and modifying one of the AQL files in the numbers module.
- Add an extractor for postal addresses of a country other than UK, France, and Britain through copying, pasting, and modifying one of the postal address modules. (Note that postal addresses for the US and Germany are covered in the Advanced cartridge already, but cannot be modified).
- Add local phone numbers (without a country prefix such as +43 or 0043) through adding a regex extraction view to the respective national AQL file in InternationalNumbers.
In your AQL code, you can output a text analysis view ViewName through adding the following statement to your AQL file:
output view ViewName as ‘com.ibm.systemT.ViewName’;
In this example, com.ibm.systemT.ViewName is an alias. The StoredIQ cartridges require these aliases (as well as any output views) to be registered UIMA types. A view name or alias can be registered through adding it into the hierarchy of the UIMA type definition files namedentity_typesystem.xml (for string-type views) and numbers_typesystem.xml (for numeric views) in the cartridge directory. For example, if you add an AQL view or alias GDPRFocusedDataDiscoveryBasic.CustomerNumber, you should also add the following UIMA type definition element to the numbers_typesystem.xml file:
<description>Customer number of company xyz</description>
As soon as you have changed the configuration or AQL code of your cartridge, the correctness of your code including UIMA typing an be checked and TAM files generated through the cartridge compiler. This compiler is run from the ./SiqDevEnv directory:
For defining the scope of the cartridge compiler, go to the ModulesToCompile element within the cartridgedescriptor.xml file. The cartridge compiler will process all modules listed in this element, for which no TAM file is found:
Note that the compiler will write a TAM file to the CompileOutputURI directory only if its modules appear in the ModulesToLoad element described above. All modules processed through the cartridge compiler are checked for AQL correctness and for compliance with the UIMA type system specified. In the error case, problems found in the UIMA AQL wrapper code can be easily distinguished from problems detected through the AQL compiler. A missing UIMA type definition, for example, will result in a runtime exception like the following:
Caused by: java.lang.RuntimeException: No UIMA type information for output com.ibm.systemT.GenderIndicator at com.ibm.avatar.algebra.util.uima.UIMAWrapper.typeSystemInit(UIMAWrapper.java:597)
The following is an example of a compiler error in the AQL code, which is being described with AQL file name and line/column number:
Caused by: com.ibm.avatar.api.exceptions.CompilerException: Compiling AQL encountered 1 errors:
In /home/gsr/Documents/Products/StoredIQ/Cartridges/DENew/webapps/storediq/WEB-INF/cartridges/GDPRFocusedDataDiscoveryAdvanced/aql/BankAccountNumbers/BankAccountConsolidation.aql: At line 12, column 55 Encountered <EOF>. Was expecting a: “String constant”
Since the compilation step adds to the overall execution time, depending of course on the number and complexity of the AQL modules, make sure that all modules from the ModulesToCompile element for which a valid TAM file exists are removed (or commented out) from the ModulesToCompile list. This is even more important once you build a new cartridge (see section 7) to deploy it into StoredIQ.
Testing Cartridge Changes
Good cartridges require careful testing with test data being taken whenever possible from the same data collection the final cartridge is to be applied to. Test data should reflect the parameters language, ASCII formatting, context of hits, all UIMA types, and all AQL views to be tested. As to formatting, remember that there is no binary deformatting, for example from a .ppt document to its ASCII format within the CartDevEnv. In some cases, it is unavoidable to add test output statements to the AQL code, which must be reflected in corresponding (test-wise) UIMA type definitions, of course.
There are the following command line tools available in the CartDevEnv’s bin directory for different testing needs, namely:
- batchdocumentprocessor_sample.sh – for testing large document sets
- batchregressiontester_sample.sh – for regression testing with tabular input
- cvd.sh – for testing interactively small amounts of documents
- batchresultanalysis.sh – for comparing two .csv output files of batchdocumentprocessor_sample.sh runs
The first three tools are described in the chapters 6, 7, and 8 of the dW recipe on RegEx Cartridges. The tool batchresultanalysis.sh is called from the ./SiqDevEnv directory with three parameters, as follows:
- two .csv files as produced by running the batchdocumentprocessor_sample.sh tool – an “old” one, and a “new” one, and
- a directory reference where the results of the comparison will be written to.
Here is an example of a call for the results directory ./SiqDevEnv/testAC:
bin/batchresultanalysis.sh ./testAC/bdpOld.csv ./testAC/bdpNew.csv ./testAC/
Note that the batchresultanalysis.sh tool can run for a while and will overwrite the result files produced from a previous run without warning. The output file, in the example above ./SiqDevEnv/testAC/resulttable.csv, will contain descriptive statistics for the two .csv files, and for the non-empty difference sets Old-New, and New-Old both on the type and document level.
Build a New Cartridge
As we have seen above, the .car cartridge file is in fact a .zip file. If you want to run your cartridge as a new cartridge version within StoredIQ (see the StoredIQ Knowledge Center on Cartridges), please increment the version property in the cartridge.properties file. Then, the cartridge’s .car file can be created through executing the following command from the cartridge directory:
zip -r ../GDPRFocusedDataDiscoveryAdvanced.car ./*
This command will overwrite a previous car file version in the CartDevEnv. The .car file can then be used for a cartridge upload or upgrade to StoredIQ, as described in the StoredIQ Knowledge Center. Since any customer-specific modifications of the cartridge will be overwritten when the cartridge is being updated, it is best practice to rename cartridges with such modifications.