Overview

Skill Level: Intermediate

Familiarity with Eclipse is helpful

Learn how to set up and use an Eclipse environment to develop System-T-based StoredIQ cartridges that include user defined functions (UDFs), so that you can build your own information extraction functionality. 

Ingredients

IBM StoredIQ enables powerful information extraction of highly meaningful text elements such as person, organization or product names, dates, or telephone numbers. For this purpose, it uses so-called cartridges, that is, plug-ins into the StoredIQ servers which provision the extracted terms together with their type semantics for StoredIQ use cases such as indexing, searching, machine learning, or classification. 


This recipe explains how to create new such cartridges in order to satisfy individual customer needs. You will learn how develop customized cartridges using the AQL text analysis language, and how to compile and build them within the Eclipse programming environment. The recipe draws on the following online resources:


Recipes on modification of StoredIQ cartridges and on regex cartridges:

https://developer.ibm.com/recipes/tutorials/how-to-modify-systemtbased-cartridges-for-storediq/
https://developer.ibm.com/recipes/tutorials/developing-regular-expression-cartridges-for-storediq/

An overview of UIMA: https://uima.apache.org/index.html

Eclipse: https://www.eclipse.org/

AQL reference: https://www.ibm.com/support/knowledgecenter/en/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.biginsights.aqlref.doc/doc/aql-overview.html

 

This recipe begins with setting up the Eclipse development environment, importing the Eclipse projects and StoredIQ runtime code as needed. Then you will learn how to develop a simple cartridge containing AQL rules, a dictionary, and user defined functions, all covered through the UIMA annotator definitions. Finally, the recipe will show the compile and test cycle, and how a cartridge can be built and supplied to StoredIQ after it's complete. 

Step-by-step

  1. Get the Eclipse development environment

    For this and the following steps, we assume you are working with a Unix-based system, for example, Linux or MacOS. However, Eclipse for Windows will have a very similar look and feel.

    First, download and install the latest Eclipse package suited for your operating system from the Eclipse site, or, if you have already an Eclipse system in place, create a new workspace (File –> Switch Workspace –> Other).¬† Then, download the following three project .zip files from this developerWorks site and extract them to a temporary directory:¬†¬†

    CartridgeDev.zip ‚Äď The StoredIQ tools for cartridge development¬†¬†¬†
    SampleCartridge.zip ‚Äď The sample cartridge to work with
    siq-war-storediq.zip ‚Äď The cartridge runtime project

    Then, as described in the dW recipe on regular expression-based cartridges , log in as root to a StoredIQ data server, and download the contents of its /usr/local/tomcat/webapps/storediq directory as a file siq_webapp.tgz. 

    Extract the three project .zip files into a temporary directory, say ./tmp/ which will give you the subdirectories ./tmp/CartridgeDev, ./tmp/SampleCartridge, and ./tmp/siq-war-storediq. Then, create another subdirectory, for example ./tmp/project-content, and extract the downloaded .tgz file within this directory, using the command tar -xvzf siq_webapp.tgz . This will yield a directory ./tmp/project-content/storediq. 

  2. Create Eclipse projects

    Import the 3 project directories in ./tmp/ as projects into Eclipse: For each directory, go to File ‚Üí Import ‚Üí General ‚Üí Existing Projects into Workspace, select and then import it. Make sure that Copy projects into workspace is checked and Search for nested projects is not checked for all three projects. If you see an error icon for the SampleCartridge project at this stage, you can ignore it.

    After importing the projects, select the siq-war-storediq/war folder in the Project Explorer, and click File –> Import –> File system to import the contents of our example directory ./tmp/project-content/storediq into the project siq-war-storediq, and select the example ./tmp/project-content as the From directory. Never use storediq as the From directory, at this point! Now make sure that the directory storediq in the left selection panel is check marked, but that our example directory project-content is not check marked (there is a little minus sign appearing in front of it).¬† In addition, make also sure that the two options Overwrite existing resources without warning and Create top-level folder are not checked, as shown below.¬† Then click Finish and answer with No to All when asked if anything should be overwritten. If, after this import, the error icon in SampleCartridge does not disappear, then you have to bring the systemT runtime library into your build path. The name of this library depends on your SIQ version and willl always start with system-t-runtime.¬†You will find it in the directory war/storediq/WEB-INF/lib.

    Screenshot-from-2019-04-15-11-16-33

    In the now-completed siq-war-storediq project, there may be a cartridges subdirectory in siq-war-storediq/war/storediq/WEB-INF, eventually with cartridges inside. If so, then this directory and its content should be deleted. If you’d like to work with one of the cartridges, move its content into the SampleCartridge project and rename it, as shown below.

    As a last configuration step, open the Eclipse Preferences menu, go to Run/Debug ‚Üí String Substitution and add the two substitution variables in the table below. Then position the selection bar on each of the three projects and refresh them through File –> Refresh.

     Variable name  Variable value
    SIQ_APP_ROOT ../siq-war-storediq/war/storediq/WEB-INF
     SIQ_CART_DEV  package

     

     

  3. Develop and compile AQL/UIMA code

    In the SampleCartridge project, you will find 4 AQL modules in the folder textAnalytics/src:

    • DictBasedModule
      Provides a simple dictionary list for French first names, and the dictionary code to access it
    • NamesModule
      The AQL rules for French person proper names
    • NumberModule
      AQL regex rules for recognizing French passport numbers
    • udfDefinitions
      A definition file for user defined functions 

    You might use the AQL reference to modify this code, e.g. add rules or change the regex definitions.

    In the directory SampleCartridge/cartridgefiles, you will see the three files cartridge.properties, cartridgedescriptor.xml and sample_typesystem.xml which define the UIMA configuration of the cartridge, as explained in the dW recipe on cartridge modification. Note here that each view or alias which is output in one of the AQL files, e.g. 

    output view FrenchNames as ‚ÄėSampleCartridge.FrenchPersonName‚Äô;¬†

    has a corresponding UIMA type entry in sample_typesystem.xml, in this case 

    <name>SampleCartridge.FrenchPersonName</name>

    After finishing editing, don’t forget to Refresh the directories modified. Then, to translate your changes including dictionary updates, rule and UIMA config changes and UDF programming (see below), right-click on CartCompiler.launch and select Run As → 1 CartCompiler. Errors in your cartridge setup will cause compiler runtime exceptions followed by the statement Something went wrong. E.g. when you would comment out the definition of SampleCartridge.FrenchPersonName above and run the cartridge compiler, you would get an  error message like 

    Using UIMA descriptor at: ../siq-war-storediq/war/storediq/WEB-INF/uima/config/specifiers/search_document_content_aggregates.xml
    java.lang.NullPointerException
            at java.util.TreeMap.getEntry(TreeMap.java:347)
            at java.util.TreeMap.get(TreeMap.java:278) ….. 

    Similarly, when you would try to output a non-existing view FrenchPersonNames2 you would get the error message

    SEVERE: Exception occurred org.apache.uima.analysis_engine.AnalysisEngineProcessException
    at com.ibm.avatar.algebra.util.uima.UIMAWrapper.typeSystemInit (UIMAWrapper.java:604)

     

     

  4. Cartridge compiler details

    The CartridgeCompiler.launch is a so-called launch group which will execute a sequence of launch operations. When doing in-depth work with the development environment, it is helpful to understand these following operations in detail:

    • prepare.launch
      Compiles java UDFs and copies your cartridge with all its content including your modifications into a runtime environment, and prepares it for compilation.
    • compile.launch
      Performs the compilation both on the UIMA and the AQL level. This launcher will throw an exception when you try to run it without prepare.launch preceeding.
    • reset.launch
      Copies the results of compilation back into your sample cartridge, and resets the compilation environment. 

    All of these can be run through right-clicking the launcher name and selecting the Run option. The prepare.launch is needed when you have done some code changes in order to make them available for compilation. The reset.launch re-synchronizes the compilation results, such as the compiled tam files, with your cartridge project, and makes the environment ready for the next prepare-compile-reset cycle. Note that after a successful reset.launch, you can also run the cartridge compiler with another cartridge, or run the same, but re-named cartridge (see below).

  5. Rename your cartridge

    When you upgrade a cartridge with the same name but a different version, StoredIQ will overwrite the entire program content of the former cartridge version. In particular, you should consider GDPRFocusedDataDiscoveryBasic and GDPRFocusedDataDiscoveryAdvanced as reserved names not to be used for any custom cartridge content. To change the name of your cartridge, insert it in the following places instead of the name SampleCartridge: 

    • the project name of the cartridge. This should be done through a refactoring of the project name so that Eclipse’s .project file is also changed.
    • the file cartridge.properties, both for the name itself and for the property mappings
    • the cartridgedescriptor.xml file, where it occurs as a project name and in module names
    • Sometimes, the cartridge name is contained in launch files or Ant xml scripts. You have to use the Eclipse’s Navigator View in order to see such files if they are hidden.
    • in the naming of UIMA types in sample_typesystem.xml, where the cartridge name might serve as a name space qualifier

    After having performed these changes, compile the cartridge to update the cartridge compile environment and to check for consistency.

  6. Developing user defined functions (UDFs)

    In the folder SampleCartridge/src, you will find the package com.sample.aqludfs, which contains the sample class MyUDFs.java. We will show now how to add and use another UDF. 

    First, add the following method definition to the class MyUDFs: 

     public String MyLowerCase(Span s) {
       return s.getText().toLowerCase();}

    Note that this class refers to the Java type Span imported from com.ibm.avatar.algebra.datamodel. Now go to the file textAnalytics/src/udfDefinitions/udfdef.aql and add this function definition: 

    create function MyLowerCase(s Span)
        return String
    ¬†¬†¬† external_name ‘./SampleCartridgeUDFs.jar:com.sample.aqludfs.MyUDFs!MyLowerCase’
        language java
        deterministic
        return null on null input;

    Because we want to use this function definition in another AQL module, we have to add the statement

    export function MyLowerCase;

    Now edit the file textAnalytics/src/NamesModule/FrenchNames.aql and replace the select entry with this one:

    select CombineSpans(R1.match, R2.match) as match, MyLowerCase(R2.match) as FNcanonical

    This will cause a string consisting of the lowercase rendering of the AQL span for the family name to be returned in a column named FNcanonical. Now compile your changes and make sure no errors are being found.

  7. Testing StoredIQ cartridges

    To test your SampleCartridge modifications, first make sure you have an appropriate set of test documents. The directory SampleCartridge/TestDocuments contains a simple test document for French person names. 
    After having compiled your cartridges successfully (including the final reset.launch, of course), first right-click prepare.launch and run it one time so that your test environment is being prepared. Then right-click CartridgeDev/BatchDocumentProcessorSample.launch and apply your cartridge through running it. The output of this test run can be found in the .csv file package/cpetestoutput.csv. Before re-entering the compile cycle, make sure you reset the test environment through running reset.launch.

    Further details on the cartridge test tools can be found in the dW recipes on cartridge modification and on regex cartridges.

  8. Building StoredIQ cartridges

    After having finished the compile-and-test cycle of your AQL program, you may want to build a cartridge that can be uploaded to StoredIQ. To do so, refresh your cartridge directory and run the SampleCartridge/build.xml script file through right-clicking on it ans selecting 2 Ant build … Make sure that all Ant targets of this file are activated, i.e. have a check mark.

    Running build.xml with all Ant targets activated will create a file named SampleCartridge.car (or whatever the name of your cartridge is) in the directory SampleCartridge/car. This is the StoredIQ cartridge, which can be uploaded immediately to StoredIQ, as described in the recipe on cartridge modification. You will also find a ./tmp sub-directory which contains exactly the contents of the SampleCartridge.car archive file.

Join The Discussion