Are you designing a Streams application that will connect to a database?  The flowchart below will provide a basic overview of which toolkits can be used, and which databases are supported. chart-no-color  

General comments

  • The chart above applies to Streams applications written in SPL.  If you have a Java, Python or Scala application,  you would need to connect to your target database using the client for the language in question.
  • In addition to Cloudant or the HBase service on Bluemix, you can integrate any database that provides REST endpoints with Streams using the HTTPPost and HTTPGet operators in the inet toolkit.  The toolkit also includes the httpPut and httpDelete native functions for PUT and DELETE requests, respectively.
  • If you would like information about updates to your database to be pushed to your Streams application, consider using  IBM Change Data Capture (CDC).  CDC monitors database logs to detect data changes.  This has the advantage of providing notifications about changes in the database without having to actually query the database.  It integrates with many popular database systems, including IBM DB2®, IBM i, IBM Informix®, Oracle, Sybase, Microsoft SQL Server,  and more.  The open source CDC toolkit integrates with CDC.
  • In certain scenarios involving location based data, you could use some of the operators in the Geospatial toolkit instead of a database. The Geofence, SpatialGridIndex and PointMapMatcher operators provide in-memory storage and fast lookups for common use cases.

Avoid performance problems

  • Cache table data in memory instead of issuing frequent queries.  For example, your application might want to lookup a customer’s information based on their phone number or account id.  You could introduce a bottleneck in your application if it queries the database for every tuple because the high throughput of tuple processing could easily overload the database.  Especially when using a relational database, avoid such a bottleneck by designing your application such that it pre-loads the data into an in-memory cache and then queries the cache.   The data in the cache can be periodically refreshed based on the frequency of updates.  If a single in-memory cache is insufficient for performance reasons, or if the cache is to be shared between operators, you could:
    • Partition the cache across multiple hosts, or
    • Cache the data using the DPS toolkit and an supported in-memory key-value store as a backend.
  • Using Cloudant on Bluemix? Be aware of the limit on the number of requests per second. If automatic threading is enabled for any operators that connect to the service, this might inadvertently increase the number of simultaneous write requests since the operator will become multi-threaded.
  • When inserting data, it is good practice to buffer writes for the sake of performance.  Most of the toolkits referenced here  provide support for batching inserts/writes.

 More information

The following chart describes each toolkit from the above diagram.  Some of these toolkits are open source and so the chart also indicates whether the toolkit is included in the Streams product or must be downloaded separately.
Toolkit Description Separate download required? Resources
JDBC toolkit Allows Streams applications to integrate with any JDBC compliant database using the JDBCRun operator. Yes, for Streams versions older than 4.2.
Database toolkit The database toolkit is an alternative to the JDBC toolkit that connects using the ODBC API.   Supported databases include: IBM DB2, IBM InfoSphere BigInsights, IBM Netezza, Oracle Database, and Teradata Database. No
DPS toolkit  You can use the DPS (Distributed Process Store) toolkit to connect to many NoSQL databases using the set of functions that it provides. Data stored by the DPS toolkit in a database is only accessible from Streams applications.  Official support is provided for Redis. Yes, for Streams 3.x and 4.0.  No download required for 4.1+.
HBase toolkit for Bluemix Use this toolkit only if you are connecting to an instance of HBase running in Bluemix. Yes
HBase toolkit Provides integration for HBase for on-prem instances of HBase. No
Cloudant via REST You can use the HTTPPost and HTTPGet operators  in the version 2.x of the Inet toolkit to connect to the Cloudant service on Bluemix. See the related article for details. No
Cassandra toolkit*  The Cassandra toolkit supports writing data to Cassandra from Streams using the CassandraSink operator.  Therefore, use this toolkit  instead of the DPS toolkit to communicate with Cassandra if you would like the data written from Streams to be visible by other applications and vice versa. Yes
MongoDB toolkit* If you would like the data written from Streams to be visible by other applications, use the MongoDB toolkit instead of the DPS toolkit. This is because data written to MongoDB by the DPS toolkit is encoded and is not visible outside of Streams.  This toolkit is in incubation. Yes
Parquet toolkit* Provides the ParquetSink operator for writing to Parquet.  Yes
Toolkits in green only provide write access to the target DB. Toolkits marked with an asterisk (*) are open source and still in incubation.

Join The Discussion