An SPL application that runs in a non-cloud, on-premises installation of Streams might require modification before it can run in the Streaming Analytics service in Bluemix™. This article describes SPL application constructs and patterns that may not work or may function differently in the cloud, and provides advice on how to make SPL applications that use these techniques “cloud-ready.”
 

Passing configuration information to an SPL operator as a file

Issue

Several SPL operators have parameters that expect a file reference. One example is the RScript operator that uses a parameter to identify the file name containing an R script, i.e. param rScriptFileName : "myScript";. Users of the Streaming Analytics service in Bluemix do not have direct access to the file system of the hosts in the cloud, so a specific technique must be used to supply operator configuration files to the cloud.

Cloud-ready technique

If you use SPL operators that require configuration files, there are a couple of alternatives for updating your application to make it cloud-ready:
  1. The most straightforward technique is to include the configuration files in the streams application bundle (SAB). The preferred location for most configuration files is the etc directory of the application. The etc directory is automatically included in the application bundle. The file reference in the SPL application should then refer to the files by dynamically obtaining the bundle location, i.e. rScriptFileName : getThisToolkitDir() +"/etc/process.r" ;  (Alternatively you may use a different directory in the .sab structure for your operator configuration files by specifying a <sabFiles> element in the application’s info.xml file.)
  2. If your configuration files are very large, including them in the streams application bundle (SAB) will significantly increase the size of your SAB.  Large SABs take longer to upload on job submission, and some SABs could even become too large to submit.  Instead of including the large files in your SAB, you can store them in Bluemix Object Storage and access them from your Streams application.  See Access files in Bluemix Object Storage from Streaming Analytics for more information about this technique.
 
 

Use of FileSource and DirectoryScan to stream new data into an SPL Application

Issue

A common pattern in SPL applications it to get data into the application via a FileSource. The FileSource operator will function correctly in the cloud, but the ability of the application to get new data into the application via a FileSource is limited as you do not have direct access to the file system on the streams application nodes. The files accessed by a FileSource are in the local file system of a Streams host in the cloud. An SPL application running in the cloud can access the local file system, but there is no way for an entity outside of Streams to populate a file with new data or put a new file into a directory that is being monitored by a DirectoryScan. The FileSource model will work correctly to get one-time input, configuration, initialization or other data into an SPL application, if the file is included in the streams application bundle.

Cloud-ready technique

If your SPL application uses FileSource as the means for obtaining new data, there are a few alternatives for modifying your application to make it cloudy-ready:
  1. Store your data files in Bluemix Object Storage, and access them from your Streams application.  This enables your solution to dynamically create new files or update existing files in Bluemix Object Storage that your Streams application can access.  See Access files in Bluemix Object Storage from Streaming Analytics for more information about this technique.
  2. Serve your files through a web application server, and access them using the InetSource operator instead of FileSource.
  3. Use a Bluemix application to access the file-based data, and stream the data from your Bluemix application into your SPL application via another type of adapter.
  4. Move to a non file-based approach for input (e.g. Kakfa, MQTT, MQ, HTTP, etc.)
  5. Create a new SPL application that includes the new file in the stream application bundle, then use a FileSource to read from the bundle and FileSink to write the file out to /tmp in the file system on the same streams application node. (see example in Use of FileSink to write output data below)

 

Use of FileSink to write output data

Issue

A common pattern in SPL applications is to produce data and write it to files using a FileSink for consumption by other non-streams applications. The FileSink operator will function correctly in the cloud, but since only the SPL application can access the file system on a Streams host in the cloud, writing data to files is of limited value (unless another Streams operator located on the same cloud host is accessing it.)  

Cloud-ready technique

If your SPL application uses FileSink as the means for producing output data, there are a few alternatives for modifying your application to make it cloudy-ready.
  1. Move to a non file-based approach for output. For example use KafkaProducer, HTTPPost, JMSSink, or another adapter to send results out from your SPL application.
  2. Store the file written by your FileSink in Bluemix Object Storage, so other components in your overall solution can access it.  See Access files in Bluemix Object Storage from Streaming Analytics for more information about this technique.
  3. If you were merely using the FileSink to write out debugging data, you could:
    1. Remove the FileSink and use a view annotation or create a dynamic view in the Streams Console to see the data.
    2. Replace the FileSink with a Custom operator that prints the data and then use the Streams Console to download the job log to view the data.
      () as trace1 = Custom(strFile) {
        logic
          onTuple strFile:{
            printLn(strFile);
          }
      }
  4. If you are using the FileSink to produce files used by downstream streams operators you could:
    1. Specify a fully qualified file name to be created in the /tmp directory. For example:
      () as FileSink_2 = FileSink(inLine)
              {
                  param
                      file : “/tmp/myfile.txt”;
                      format : line;
              }
      
    2. Specify /tmp as the data-directory during job submission and then specify the unqualified file name in the operator file parameter. For example: filesinkexample
      () as FileSink_2 = FileSink(inLine)
              {
                  param
                      file : “myfile.txt”;
                      format : line;
              }
      
      
  5. Note: care must be taken to ensure the spl application operators writing the files and reading the files are co-located on the same application host.
    1. If the operators are in the same job you can use the hostColocation constraint to have the system ensure they are placed on the same host. For example:
      composite FilePutter
      {
          graph
              stream<blob myline> inLine = FileSource()
              {
                  param
                      file : getThisToolkitDir() +"/etc/putter/model.str"  ;
                      format : block;
                      blockSize : 1024u;
                      config placement : hostColocation("someId");
              }
      
              () as FileSink_2 = FileSink(inLine)
              {
                  param
                      file : "/tmp/newModel.str";
                      format : block;
                      config placement : hostColocation("someId");
              }
      
      }
      
    2. If the operators are in the different jobs you can use the host pools and the host placement constraint to have the system ensure they are placed on the same host. For example:
      File writing job: 
      
      composite FilePutter
      {
          graph
              stream<blob myline> inLine = FileSource()
              {
                  param
                      file : getThisToolkitDir() +"/etc/putter/model.str"  ;
                      format : block;
                      blockSize : 1024u;
                      config placement : host(P1);
                  }
      
              () as FileSink_2 = FileSink(inLine)
              {
                  param
                      file : "/tmp/newmodel.str";
                      format : block;
                      config placement : host(P1);
              }
              config hostPool: 
                  P1=createPool({size=1u, tags=["host1"]}, Sys.Shared); //sized, tagged, shared
      
      }
      
      Directory scan job:
      composite DirectoryScanExample {
          .....
          
          graph
              stream<rstring strFilePath> strFile = DirectoryScan(){
                  param
                      directory : "/tmp";
                      pattern : "newmodel.str";
                      ignoreExistingFilesAtStartup : true;
                      config placement : host(P1);
              }
      
              () as trace1 = Custom(strFile) {
                  logic
                  onTuple strFile:{
                      printStringLn("** 1 ** File notification: " + strFilePath);
                  }
              }
              ......
              config hostPool: 
                  P1=createPool({size=1u, tags=["host1"]}, Sys.Shared); //sized, tagged, shared
          }
      

 

Accessing on-premise data

Issue

SPL applications often access on-premise data; that is, enterprise or other data that is not located in the cloud. When Streams is installed on-premise, accessing on-premise data is straightforward. When Streams in running in the cloud, additional work may be necessary to access on-premise data.

In some rare cases, on-premise data might be publicly available, i.e. the data may be accessible via the public internet (with or without authentication.) But in most cases, on premise data is protected behind a firewall, preventing access from outside environments such as Bluemix.

Cloud-ready technique

If your SPL application needs to access on-premise data, you have a couple of options for modifying your application to make it cloudy-ready.
  • Use the Secure Gateway service in Bluemix to build a secure path for accessing your on-premises data. Then, access that on-premises data through a protocol that is supported by a Streams adapter.
  • Open up ports on the on-premises firewall to allow a Streams adapter to access the data. This method works, but it is not recommended because it is not as secure as the previous technique.

Using the com.ibm.streamsx.inet.rest operators (e.g. HTTPTupleView, HTTPTupleInjection, HTTPJSONInjection, etc.)

The streamsx.inet toolkit provides a set of operators that are often used in Streams demos and applications.  If you use these operators in an SPL application, there are some additional issues you need to consider to use them in the Bluemix cloud.  See Using Streams Operators that listen for a connection in the Streaming Analytics Bluemix Service.

Use of operators that are not compatible with Streaming Analytics

Issue

Some operators in the specialized toolkits that are provided with IBM Streams or provided by the IBMStreams GitHub organization are not compatible by the Streaming Analytics service. Here are some examples:
  • None of the adapters in the com.ibm.streams.db toolkit are currently supported by the service.
  • The streamsx.opencv toolkit is not supported by the service because it pre-reqs open source packages that IBM cannot install on behalf of the customer.
See the Streaming Analytics documentation in Bluemix for more detail about which operators are currently compatible with the service. Please Note: If the operators that you need for your SPL applications are not supported, we would like to hear from you! Post your operator needs to the Bluemix Streaming Analytics forum: https://developer.ibm.com/answers/topics/streaming-analytics.html

Cloud-ready technique

If your application uses database adapters from the com.ibm.streams.db tookit, consider using the Streams JDBC toolkit as an alternative. If your application uses unsupported data source or sink adapters, you can use a Bluemix™ application to access the data store, and then stream the data from your Bluemix application to your SPL application using another type of supported adapter. If your SPL application uses analytics that are not currently supported, your only option is to replace that part of the processing with another analytic operator that is supported.

1 Comment on "Getting your SPL application ready for the cloud"

  1. In the case of rScriptFileName, just rScriptFileName: "etc/process.r"; would work, as the operator takes any relative paths to be relative to the application root. (documentation).

Join The Discussion