An SPL application that runs in a non-cloud, on-premises installation of Streams might require modification before it can run in the Streaming Analytics service in IBM Cloud. This article describes SPL application constructs and patterns that may not work or may function differently in the cloud, and provides advice on how to make SPL applications that use these techniques “cloud-ready.”


 

Passing configuration information to an SPL operator as a file

Issue

Several SPL operators have parameters that expect a file reference. One example is the RScript operator that uses a parameter to identify the file name containing an R script, i.e. param rScriptFileName : "myScript";. Users of the Streaming Analytics service in IBM Cloud do not have direct access to the file system of the hosts in the cloud, so a specific technique must be used to supply operator configuration files to the cloud.

Cloud-ready technique

If you use SPL operators that require configuration files, there are a couple of alternatives for updating your application to make it cloud-ready:

  1. The most straightforward technique is to include the configuration files in the streams application bundle (SAB). The preferred location for most configuration files is the etc directory of the application. The etc directory is automatically included in the application bundle. The file reference in the SPL application should then refer to the files by dynamically obtaining the bundle location, i.e. rScriptFileName : getThisToolkitDir() +"/etc/process.r" ;  (Alternatively you may use a different directory in the .sab structure for your operator configuration files by specifying a <sabFiles> element in the application’s info.xml file.)
  2. If your configuration files are very large, including them in the streams application bundle (SAB) will significantly increase the size of your SAB.  Large SABs take longer to upload on job submission, and some SABs could even become too large to submit.  Instead of including the large files in your SAB, you can store them using the Object Storage service in IBM Cloud, and then access them from your Streams application.  See the Streams Object Storage Toolkit for more information about using Object Storage in a Streams application.

 


 

Use of FileSource and DirectoryScan to stream new data into an SPL Application

Issue

A common pattern in SPL applications it to get data into the application via a FileSource. The FileSource operator will function correctly in the cloud, but the ability of the application to get new data into the application via a FileSource is limited as you do not have direct access to the file system on the streams application nodes.

The files accessed by a FileSource are in the local file system of a Streams host in the cloud. An SPL application running in the cloud can access the local file system, but there is no way for an entity outside of Streams to populate a file with new data or put a new file into a directory that is being monitored by a DirectoryScan.

The FileSource model will work correctly to get one-time input, configuration, initialization or other data into an SPL application, if the file is included in the streams application bundle.

Cloud-ready technique

If your SPL application uses FileSource as the means for obtaining new data, there are a few alternatives for modifying your application to make it cloudy-ready:

  1. Store your data files using the Object Storage service in IBM Cloud and access them from your Streams application.  This enables your solution to dynamically create new files or update existing files in Cloud Object Storage that your Streams application can access.  See the Streams Object Storage Toolkit for more information about using Object Storage in a Streams application.
  2. Serve your files through a web application server, and access them using the InetSource operator instead of FileSource.
  3. Use a IBM Cloud application to access the file-based data, and stream the data from your IBM Cloud application into your SPL application via another type of adapter.
  4. Move to a non file-based approach for input (e.g. MessageHub, Kafka, MQTT, MQ, HTTP, etc.)
  5. Create a new SPL application that includes the new file in the stream application bundle, then use a FileSource to read from the bundle and FileSink to write the file out to /tmp in the file system on the same streams application node. (see example in Use of FileSink to write output data below)

 

Use of FileSink to write output data

Issue

A common pattern in SPL applications is to produce data and write it to files using a FileSink for consumption by other non-streams applications. The FileSink operator will function correctly in the cloud, but since only the SPL application can access the file system on a Streams host in the cloud, writing data to files is of limited value (unless another Streams operator located on the same cloud host is accessing it.)

 

Cloud-ready technique

If your SPL application uses FileSink as the means for producing output data, there are a few alternatives for modifying your application to make it cloudy-ready.

  1. Move to a non file-based approach for output. For example use MessageHub, HTTPPost, JMSSink, or another adapter to send results out from your SPL application.
  2. Store the file written by your FileSink using the Object Storage service in IBM Cloud, so other components in your overall solution can access it.  See the Streams Object Storage Toolkit for more information about using Object Storage in a Streams application.
  3. If you were merely using the FileSink to write out debugging data, you could:
    1. Remove the FileSink and use a view annotation or create a dynamic view in the Streams Console to see the data.
    2. Replace the FileSink with a Custom operator that prints the data and then use the Streams Console to download the job log to view the data.
      () as trace1 = Custom(strFile) {
        logic
          onTuple strFile:{
            printLn(strFile);
          }
      }
  4. If you are using the FileSink to produce files used by downstream streams operators you could:
    1. Specify a fully qualified file name to be created in the /tmp directory. For example:
      () as FileSink_2 = FileSink(inLine)
              {
                  param
                      file : “/tmp/myfile.txt”;
                      format : line;
              }
      
    2. Specify /tmp as the data-directory during job submission and then specify the unqualified file name in the operator file parameter. For example:
      filesinkexample

      () as FileSink_2 = FileSink(inLine)
              {
                  param
                      file : “myfile.txt”;
                      format : line;
              }
      
      
  5. Note: care must be taken to ensure the spl application operators writing the files and reading the files are co-located on the same application host.
    1. If the operators are in the same job you can use the hostColocation constraint to have the system ensure they are placed on the same host. For example:
      composite FilePutter
      {
          graph
              stream<blob myline> inLine = FileSource()
              {
                  param
                      file : getThisToolkitDir() +"/etc/putter/model.str"  ;
                      format : block;
                      blockSize : 1024u;
                      config placement : hostColocation("someId");
              }
      
              () as FileSink_2 = FileSink(inLine)
              {
                  param
                      file : "/tmp/newModel.str";
                      format : block;
                      config placement : hostColocation("someId");
              }
      
      }
      
    2. If the operators are in the different jobs you can use the host pools and the host placement constraint to have the system ensure they are placed on the same host. For example:
      File writing job: 
      
      composite FilePutter
      {
          graph
              stream<blob myline> inLine = FileSource()
              {
                  param
                      file : getThisToolkitDir() +"/etc/putter/model.str"  ;
                      format : block;
                      blockSize : 1024u;
                      config placement : host(P1);
                  }
      
              () as FileSink_2 = FileSink(inLine)
              {
                  param
                      file : "/tmp/newmodel.str";
                      format : block;
                      config placement : host(P1);
              }
              config hostPool: 
                  P1=createPool({size=1u, tags=["host1"]}, Sys.Shared); //sized, tagged, shared
      
      }
      
      Directory scan job:
      composite DirectoryScanExample {
          .....
          
          graph
              stream<rstring strFilePath> strFile = DirectoryScan(){
                  param
                      directory : "/tmp";
                      pattern : "newmodel.str";
                      ignoreExistingFilesAtStartup : true;
                      config placement : host(P1);
              }
      
              () as trace1 = Custom(strFile) {
                  logic
                  onTuple strFile:{
                      printStringLn("** 1 ** File notification: " + strFilePath);
                  }
              }
              ......
              config hostPool: 
                  P1=createPool({size=1u, tags=["host1"]}, Sys.Shared); //sized, tagged, shared
          }
      

 

Accessing on-premise data

Issue

SPL applications often access on-premise data; that is, enterprise or other data that is not located in the cloud. When Streams is installed on-premise, accessing on-premise data is straightforward. When Streams in running in the cloud, additional work may be necessary to access on-premise data.

In some rare cases, on-premise data might be publicly available, i.e. the data may be accessible via the public internet (with or without authentication.) But in most cases, on premise data is protected behind a firewall, preventing access from outside environments such as IBM Cloud.

Cloud-ready technique

If your SPL application needs to access on-premise data, you have a couple of options for modifying your application to make it cloudy-ready.

  • Use the Secure Gateway service in IBM Cloud to build a secure path for accessing your on-premises data. Then, access that on-premises data through a protocol that is compatible with Secure Gateway.  See Connecting Streaming Analytics to Your Enterprise Data for more information on this technique.
  • Open up ports on the on-premises firewall to allow a Streams adapter to access the data. This method works, but it is not recommended because it is not as secure as the previous technique.

Using the com.ibm.streamsx.inet.rest operators (e.g. HTTPTupleView, HTTPTupleInjection, HTTPJSONInjection, etc.)

See Connecting to Streaming Analytics in the IBM Cloud to learn under what conditions these operators (and other operators that play a server role) can be used.

 


Use of operators that are not compatible with Streaming Analytics

Issue

Some operators in the specialized toolkits that are provided with IBM Streams or provided by the IBMStreams GitHub project are not compatible by the Streaming Analytics service. Here are some examples:

  • None of the adapters in the com.ibm.streams.db toolkit are currently supported by the service.
  • The streamsx.opencv toolkit is not supported by the service because it pre-reqs open source packages that IBM cannot install on behalf of the customer.

See the Streaming Analytics documentation in IBM Cloud for more detail about which operators are currently compatible with the service.

Please Note: If the operators that you need for your SPL applications are not supported, we would like to hear from you! Post your operator needs to the IBM Cloud Streaming Analytics forum: https://developer.ibm.com/answers/topics/streaming-analytics.html

Cloud-ready technique

If your application uses database adapters from the com.ibm.streams.db tookit, consider using the Streams JDBC toolkit as an alternative.

If your application uses unsupported data source or sink adapters, you can use a IBM Cloud application to access the data store, and then stream the data from your IBM Cloud Bluemix application to your SPL application using another type of supported adapter.

If your SPL application uses analytics that are not currently supported, your only option is to replace that part of the processing with another analytic operator that is supported.

1 comment on"Getting your SPL application ready for the cloud"

  1. In the case of rScriptFileName, just rScriptFileName: "etc/process.r"; would work, as the operator takes any relative paths to be relative to the application root. (documentation).

Join The Discussion