IBM Streams V4.3 is now available and it includes a number of new features.  Some of these include dynamic and elastic scaling, simpler event time window handling, optional data types, and improved operations and management.  The new release also includes updates to the Streams Runner for Apache Beam as well as a number of enhancements to Geospatial, Timeseries and messaging related toolkits.  Streams V4.3 was released on August 3rd.  This post gives an overview of some of the highlights of the release and will be followed by more detailed articles and posts in the future.

New Features

Dynamic and Elastic Scaling

IBM Streams has been evolving over many releases to be more dynamic and elastic, and that evolution continues in V4.3 with new features to dynamically change the level of parallel processing and resources available.

Dynamic Parallel Regions

Streams introduced parallel regions in V3.2¬†and we have made a number of improvements to them since,¬† including allowing the number of parallel channels to be specified at job submission time.¬†We’ve now added support for changing the number of parallel channels and the number of resources available to the parallel region at runtime.

This means that when there is an increase in the volume of data being processed in a stream, you can increase the number of parallel channels and resources that they run on accordingly and then reduce them again when the data rates return to normal levels.

This is the initial release of these dynamic capabilities and we expect to continue evolving this over future releases to increase function and remove restrictions with a goal of providing seamless and automatic capabilities with minimal constraints and impacts on applications.

Dynamic Resources

Streams V4.3 introduces the concept of serverless workloads with dynamic resources that allow Streams applications to be deployed without pre-assigning resources to a Streams Instance.  Resources can be configured for use by a Streams Domain and then allocated to a Streams Instance when a Job is submitted and freed when the job is cancelled.  Resource assignments can be shared or exclusive to simplify workload isolation and fine grained resource allocation.

This is the initial release of these serverless concepts and we expect to continue improving this in future releases to integrate more closely with resource managers.

Optional Data Types

Streams is used to process data from many sources, and it is not uncommon for some of the data to have no value.  Streams V4.3 introduces a new optional data type that allows a tuple attribute to have no value, similar to the concept of NULL in SQL.  This improves and simplifies interactions with external systems that produce, or support the concept of, data fields which have no value. The ?? operand is used to check if the attribute has a value, and the ! operand accesses it:


//int32 attribute declared with the optional type, could be null
mutable optional myOptionalVar = null; 
//..< do some processing >
/...
if (myOptionalVar??) { //check that it is not null   
    int32 value = myOptionalVar!;  //access the value
}

Note that the expression if (myOptionalVar??) Is equivalent to:


if(myOptionalVar != null) {  ../* do something*/

This support includes the new optional data type in the Streams Processing Language (SPL) as well as updates to toolkits like the JDBC and JSON toolkits to support data which has no value.

Event Time Windowing

Streams has always supported a wide variety of windowing concepts that allow operations to be performed on a set, or window, of data from a stream based on a number of eviction and trigger policies.  Streams V4.3 introduces a new timeInterval window type for window operations based on the time of an event, that is much more intuitive than the existing attribute delta window policies on a time based attribute.  The time of event in a stream is specified by a new annotation on the stream, and the new window type greatly simplifies support for out of order and late arriving data.

Improved Java and Python Application APIs

We introduced the Java application API in Streams V4.1 and the Python application API in Streams V4.2. These are provided by the topology toolkit that is developed as an open source toolkit on GitHub. There have been a number of updates made to the topology toolkit in the open which are being included in Streams V4.3. For both Java and Python APIs we have improved the support for parallel regions including nested regions and broadcast routing. For Java we have added support for guaranteed processing with consistent regions and improved access to execution context including application configurations and custom metrics. For Python, we have added support for windowing and windowed aggregates, SPL schemas and submission parameters.

Updated Streams BEAM runner for Apache BEAM

Apache Beam is an open source unified model for defining data pipelines which allows you to develop portable applications that can be deployed to different runtime engines using a ‚Äúrunner‚ÄĚ for your runtime engine of choice.¬† We introduced the Streams runner for Apache Beam last year.¬† It allows you to run applications developed using Apache Beam on IBM Streams.¬† In Streams V4.3, we have updated the Streams runner with a number of enhancements.¬† The Runner now supports the Beam-2.4 Java SDK and Beam‚Äôs S3 IO connector to work with data in the IBM Cloud Object Storage service.

We have also added support for parallel Beam pipelines and transforms as well as tuple bundling.  Operational control and troubleshooting have also been enhanced with support for Beam application options as submission time parameters and componentized logging.  These enhancements allow translation-time and runtime trace levels to be configured separately.

Improvements for Operations and System Management

We continue to make improvements to make it easier to manage Streams and simplify operations, addressing issues we identify internally or through customer feedback and support.  In V4.3 this includes improvements to rolling upgrade support and the way that we schedule jobs to run across a cluster.

Scheduling enhancements

Streams has an advanced scheduler that decides how to deploy a Streams application’s graph of connected operators across a runtime cluster.   The scheduler makes decisions on how operators are placed into processing element (PE) processes and how the PEs are placed on cluster resources based on a combination of user defined constraints and the current load on the cluster.  Streams V4.3 enhances the scheduler to consider additional dimensions of network and memory usage in addition to CPU load.  It also allows a user to control the relative importance of these different dimensions for their applications and environment through setting thresholds for what is considered under-utilized and over utilized for each dimension.

Rolling Upgrade enhancements

We introduced support for rolling upgrades in Streams V4.2 allowing domains to upgraded to to new versions without having to stop instances or jobs and support for instances that are at different versions and upgraded independently.  In Streams V4.3 we now allow an instance to be upgraded to a new version without having to stop jobs that are running on that instance and restarting the instance.  All new jobs will run at the new version while existing jobs will continue to run uninterrupted at the prior version until it is restarted.

New and Improved Toolkits

Streams has a large number of toolkits that provide out of the box functionality to ingest, process, and output data of many types, interacting with many systems and performing rich and varied analytics. Many of these toolkits have been open sourced and are developed and released on the IBMStreams organization on github.  We continue to add new toolkits and enhance existing toolkits in the product. The number of open source toolkits, samples and utilities has grown considerably since the V4.2 release with over 80 repositories, and hundreds of operators.

New Kafka, MessageHub, MQTT and RabbitMQ Toolkits

We have added a number of new toolkits for connectors to messaging systems like Apache Kafka and RabbitMQ.  These are primarily based on existing operators that had been provided within the current messaging toolkit. Also, the Message Hub toolkit for connecting to the IBM Message Hub service, a fully managed Kafka service, by providing Kafka client based operators that simplify connectivity to the IBM Message Hub service.

New Cloud Object StorageToolkit

In V4.3 we are adding a new Cloud Object Storage toolkit that allows you to store data in the IBM Cloud Object Storage (COS) service.  It includes for CSV and Parquet formatted data with partitioning that is compatible with the IBM Cloud SQL Query service allowing Streams data written to COS to be queried using SQL.

GeoSpatial, Timeseries and Inet toolkit enhancements

The new FlightPathEncounter operator in the Geospatial toolkit adds support for altitude, as well as latitude and longitude to the toolkit.  Other enhancements include algorithm improvements in the Timeseries toolkit for forecasting.  The inet toolkit also now includes some operators that had only been available in the open source release of the toolkit.

Streams Quick Start Edition on DockerHub

Streams V4.3 Quick Start Edition is available as a Docker container on DockerHub.

 

Learn More

The Streams v4.3 knowledge center has more details about the new features.
Stay tuned to Streamsdev over the next few weeks for articles and videos about what’s new in Streams V4.3.

 

Join The Discussion