IBM Streams V4.2 is now available including a number of new features which support edge to center analytics for IoT with Apache Edgent integration, development with Python & Jupyter Notebooks, and integrated development and optimized execution of Business Rules.  This release also makes a number of improvements in application deployment, performance, security, serviceability and administration.  The new release also includes new and improved analytics including speech to text, inclusion of open source toolkits for datetime, IoT , JDBC, JSON and Network analytics, as well as enhanced support for geospatial, cybersecurity and Telecommunications Event Data Analytics (TEDA) toolkits.  Streams V4.2 was released on September 23rd.  This post gives an overview of some of the highlights of the release and will be followed by more detailed articles and posts in the future.

Edge to Center analytics for IoT use cases

Stream processing systems are commonly used to process data from edge devices and there is a need to push some of the streaming analytics to the edge to reduce communication costs, react locally and offload processing from the central systems. In February IBM announced an open source project to build a community for accelerating analytics at the edge that is now undergoing incubation at the Apache Software foundation as Apache Edgent. Streams V4.2 includes an integration with Apache Edgent and Watson IoT Platform to provide a hierarchy of edge to center analytics across devices.

edgetocenteranalytics1

Edgent allows devices to react locally to events and help reduce communication costs by sending data to central systems when an event of interest occurs. This allows all relevant data to be delivered the moment an event of interest occurs – rather than just relying on a periodic heartbeat of information, which may not include the signals necessary or might arrive too late to take action on. Device hubs like Watson IoT platform provide device management and message brokers for communication. Central Analytic systems like IBM Streams provide full featured streaming analytics with additional context & state, correlating across devices and access to data-of-record systems.

edgetocenteranalytics2

The Streams integration with Edgent and IBM Watson IoT platform enables automatic application connectivity and central job management and health summaries of edge applications.   Streams applications can also control the Edgent applications on the edge devices based upon analytics across devices and the central system. The public pluggable device hub interface allows users to develop custom device hub implementations for their own device hubs and messaging systems like Kafka.

Developing in Python

The new Python Support allows Python developers to create Streams applications and Operators in Python without any knowledge of Streams declarative language SPL, C++ or Java.  The Python Application API is similar to the Java Application API released in V4.1 and uses a functional style with operations on Streams and Windows which make it natural for Python developers while also making it simple to use powerful Stream processing concepts.  It also allows Python developers to get the full benefit of Streams utilizing existing features and analytics like parallel regions, checkpointing, PE and host placement as well as supporting SPL streams, composites and toolkits.  As with Streams support for Java Applications and Operators, the Python support is not a replacement for SPL or C++ and Java operators, it is an additional way to develop Streams applications and operators which recognizes that many developers know Python and could more easily develop in Python than learning SPL, Java or C++.  Python is becoming a very popular language for data scientists with a large number of analytics available in modules like SciKits that can be used and invoked within Streams applications and operators using the new Streams Python support.  Another popular feature of Python is the interactive notebook development environment.  Streams Python support includes the ability to develop, build and submit Python Streams applications directly for a Jupyter notebook.  You can also view Streams data directly in the notebook and plot dynamically updating visualisations, as shown in a simple example below:

A very simple Streams application in Python with a live plot of the stream in a Notebook
A very simple Streams application in Python with a live plot of the stream in a Notebook

The Python Application and Operator APIs are open source and we will continue to develop it in the open with frequent releases on the IBMStreams streamsx.topology project on github where contributions are welcome.

Business Rules Integration

The Streams Rules toolkit that enables IBM Operational Decision Manager (ODM) rules to be executed as part of a streams application flow was introduced in Streams V3.2.  ODM provides a rules language for easily describing business rules and a rich set of tooling to create and manage the rules.  Streams provides a scalable distributed platform for rich analytics on streams of big data which can have massive volumes, lots of variety and high velocity.  The Streams rules toolkit allows business rules to be applied to Big Data with the rich language and tooling of ODM and the scaling and full context of many data sources and rich analytics of Streams.  In V4.2 we now support a deeper integration of ODM rules tooling and a rules compiler.  Streams studio now supports an embedded business rules project with the ODM rules designer embedded in Streams studio for editing rules, shown below.

rules_editor_in_studio_2

The rules compiler supports generating SPL types from the business rules and compiling the rules to Streams C++ operators for high performance low latency optimized execution.  The Compiled rules can also be dynamically updated in Streams applications while they are running. 

Performance and Application Deployment

Streams has many large enterprise customers who are pushing the limits on performance and have to manage the deployment of mission critical applications and we have made significant improvements in these areas.

Parallel Region enhancements

We introduced Parallel regions in V3.2 that allowed a developer to indicate that a region of the application should be replicated to process the stream in parallel using a simple @parallel annotation. V4.2 adds support for nested parallel regions and the ability to send all tuples on the stream to all channels without partitioning.

Consistent Region enhancements

We introduced Consistent regions in V4.0 to support exactly once and at most once guaranteed processing.  Consistent regions use the Chandy Lamport algorithm with Streams operators checkpointing their state to establish a global consistent state.  We have made several improvements to improve the performance and reduce latency when establishing a consistent state. This includes asynchronous non-blocking checkpoints, increased concurrency of establishing and restoring checkpoints and support for the Streams Hyperstate Accelerator, a hardware accelerated store, for checkpoints.

Scheduler and Configuration enhancements

Improvements to the Scheduler, that determines how Streams applications are distributed over a cluster, provides better utilization across different sized resources and allows resources to be reserved for specific application purposes like ingest. We have also introduced a couple of configuration improvements. Job configuration overlays allow application configuration and parameters to be provided in a separate config file at submission time.  The same application can be run with different configurations without any code changes or recompile.  This allows applications to be tested on one configuration and then moved to production with a different configuration knowing that the application code has not been changed between test and production.  Secure application Configurations allow a set of properties to be created at a domain or instance level that can then be accessed by name from an application.  This allows applications to use logical names like SensorDataQueue to reference external systems without knowing the details.

Automatic fusion and dynamic threading

Perhaps the largest change in performance and application deployment is the introduction of automatic fusion and dynamic threading.  Streams applications are made up of a graph of operators that are executed in Processor Element (PE) processes.  The assignment of operators to PE processes is called fusion.  Prior to V4.2 fusion was determined at compile time with a default of 1 operator per PE.  Changes to fusion required a recompile and threads had to be defined manually using threaded ports.  With V4.2 we automatically determine fusion by default based on the resources available.  Manual placement constraints take precedence and a user can also specify the number of PEs they want at submission time or ask for the1 to 1  legacy behavior.  By default threading is automatic, with a pool of worker threads whose size is determined at runtime and dynamically adjusted as the throughput and load for the application changes.  This can dramatically reduce the number of PE processes and load on a system as well as improving default performance.

Security Enhancements

Streams has extensive security capabilities that have been further strengthened in v4.2 with the ability to encrypt all inter host communication using TLS/SSL and support for Kerberos including Single Sign On and Microsoft Active Directory Authentication through Kerberos Tokens.

Serviceability and Administration Enhancements

We are constantly working to improve serviceability to make it easier to manage Streams and develop applications, addressing issues we identify internally or through customer feedback and support.  In V4.2 this includes improvements to defensive coding, error handling, logging and troubleshooting.  We have also improved the metrics that we provide and monitoring and usage of zookeeper.  Streams V4.2 now supports rolling upgrades of domains to new versions without having to stop instances or jobs.  A Streams domain can also support instances that are at different versions and upgraded independently.

The streams console has been improved to support custom templates for hyperlink targets in hover boxes that are used to drill down into more detailed views for an entity like a PE.  This allows users to setup custom dashboards with the information they find most useful for a given entity type.  Streamtool has also been improved to provide persistent history in console mode and to allow management of jobs by name in addition to id.

New and Improved Toolkits

Streams has a large number of toolkits that provide out of the box functionality to ingest, process, and output data of many types, interacting with many systems and performing rich and varied analytics.

New Watson Speech to text Toolkit

In V4.2 we are adding a new Speech to Text toolkit that converts an audio input streams of speech into text output that can be analyzed using text analytics to extract elements of interest and determine sentiment.  This toolkit uses the IBM Watson Speech to Text capabilities embedded in a streams operator for in-line execution on a stream without calling out to the Watson cloud service.

Inclusion of Open Source Toolkits

The IBMStreams open source community on github has over 40 toolkits as well as samples, benchmarks and demos.  We have added a number of these open source toolkits to the product in V4.2. These include datetime, IoT , JDBC, JSON and Network Toolkits. The IoT toolkit provides connectivity to the Watson IoT platform.  The Network toolkit enables network packets to be ingested, parsed and processed including mapping ip addresses to locatiojs and reassembling files from packets.  Toolkits which have not been included in the v4.2 release can be downloaded directly from the IBMStreams community on github, and we also welcome any contributions for new toolkits or enhancements in the community.

GeoSpatial, CyberSecurity, TEDA and DPS toolkit enhancements

Some of the existing toolkits have also been enhanced including Support for shared maps in the geospatial toolkit, integration with Qradar and algorithm improvements in the cybersecurity toolkit, support for the Streams hyperstate accelerator hardware accelerated store in the Distributed Process Store (DPS) toolkit and a number of enhancements to the Telecommunications Event Data Analytics (TEDA) toolkit.

Join The Discussion