InfoSphere Streams V4 is a major new release with significant advances in high availability and ease of use. This release includes a number of new features which makes Streams simpler to manage and more resilient as well as providing integration with Microsoft Excel.  This release also adds new and improved analytics and connectivity including toolkits developed in our IBMStreams open source community on github.  These include support for Apache HBase and Apache Kafka as well as enhanced Geospatial and Timeseries analytics.  Streams V4 was released on March 13th.  This post gives an overview of some of the highlights of the release and will be followed by more detailed articles and posts in the future.

Automated High Availability and Next Generation Architecture

InfoSphere Streams has had many high availability features for some time but in previous releases these could be complex to setup and require manual intervention in some cases to recover from failures.  Streams V4 is self healing with automatic recovery from failures without manual intervention. The Streams V4 release includes a new next generation architecture allowing an administrator without any specialized HA skills to quickly and easily configure Streams to be resilient and use a single console to manage multiple instances with common users and hosts.  Streams is now simpler to setup and administer, more resilient, more secure, more automatic and more dynamic.

New Streams Domain Concept

We have introduced a new Streams domain concept which is a container for Streams instances which provides a single point for configuring and managing common resources, security and instances. A domain provides a single management console for monitoring and managing the domain and all of its instances. A domain manager helps users create and manage domains.

The domain is responsible for the following:

  • Configuration : Global configuration for the Streams domain and defaults for new instances.
  • Instance management : Allow users to configure and manage instances.
  • Resources : Allow users to configure and manage the host resources available for instances in the domain.
  • Security : Users are configured and managed by the domain.  The domain is responsible for authenticating users and checking that they are authorized to perform actions against the domain and instances.
  • Public API : Provide JMX and REST apis to manage and monitor the domain and instances.

See the documentation to learn more about Streams domains.

Faster, Simpler Setup

The Configuration of management service placement has been simplified using a tag based system. The user can allocate management services to resources using simple “management” and “application” system tags or let Streams place them fully automatically based on the resources available and the requested redundancy.  More specific control is possible using system tags (audit, authentication, jmx, sws, view) for services which may require specific resources and firewall configurations.  We have also removed dependencies so that shared file system and ssh are not required and DB2 is no longer used for recovery.

 Improved Resiliency and Self Healing

Streams Management services are now more resilient with automatic recovery from failures . Recovery is always on using zookeeper and support for multiple redundant copies of services removes single points of failure.  Here are a few examples of how Streams handles failures automatically:

  • Management Service Failure : A standby copy of the service takes over as the leader if the failed service was the leader.  Restart the failed service.
  • Management Host Resource Failure : Standby copy of services takes over as the leader if any services on the failed host resource were the leader. Restart the failed services on alternative resources.  The failed Host resource will automatically rejoin the Streams domain and instance on recovery.  Management services may automatically move back to the host resource or to newly added resources with appropriate tags to balance the load and increase resiliency to additional host resource failures.
  • Application Host Resource Failure : Application PEs are automatically restarted on alternative resources.

All New Admin Console

The Streams admin console has been completely updated with a new look and feel and customizable dashboard. A single console for each domain allows users to view all the resources, instances and jobs that they have permissions for. A summary of the system health is always visible at the top of the screen. This summary allows the user to filter what system objects are shown in the console and provides context based actions on the objects like “stop instance” for a running instance. A tree based view of all system objects similar to the Streams Studio explorer is also available. Dashboard widgets can be “flipped” for graphical and tabular views of the system data.

StreamsConsolev4

 Simpler and Improved Security

Streams users no longer require an Operating System user account on the Streams Resources and ssh is also no longer required.   Streams can authenticate users against LDAP and then authorizes their actions based on the Streams permissions. The support for LDAP is more flexible to handle multi-part lookup and supports Microsoft Active Directory. Authentication and authorization checks are made for all api and tooling requests. Firewall configuration is also simplified with tag based placement of services which need to be accessed by tooling (jmx, sws, view) and configurable port ranges. Streams audit log capabilities have also been improved with a new audit service which uses configurable Log4j appenders to write the log. Audit log entries are created for all user commands and results including details like the Streams User, action (start instance, submit job etc.) the timestamp of the action and the command parameters.

Roles and Job Groups

Roles and Job Groups have been added to simplify managing permissions. User permissions can now be managed by user defined roles like “administrator” or “developer”. A role is a group of permissions that are managed by streams and are separate from user groups which are managed by LDAP or the OS. Permissions can be assigned to roles and then roles can be assigned to users or groups. Job permissions can now be managed by user defined Job Groups like “LogAnalytics” or “Ingest”. A Job is assigned to a job group when it is submitted for execution and permissions for that job are resolved based on the job group it is assigned to.

A Few More Things

The command line utility “streamtool” has been extensively updated to provide an interactive shell mode with command completion and history. This can be very useful to have open in a terminal or to pass more complex sets of commands using linux here documents. The install is now fully versioned which will simplify running multiple versions of Streams on the same cluster. Where Streams no longer requires a shared file system we have introduced application bundles and automatic provisioning to get management services and application code to the required resources to run. Applications now build to a self-contained relocatable bundle which can be submitted to a Streams instance for execution without any of the application code needing to already be on the Instance resources.

Application Resiliency

InfoSphere Streams has been able to detect Application Processing Element (PE) failures and automatically restart and reconnect the PE since its first release but this could result in tuples not being processed. We are introducing new consistent regions which allow a developer to guarantee that all data is processed in that region with a simple annotation (@consistent ) and HA compliant operators. Many systems for guaranteeing processing add an overhead to the processing of every tuple, we have taken a more lightweight approach where we periodically establish a consistent state that we can recover from on failure.  A consistent state is a point in time where all tuples for all streams in a consistent region have been fully processed by the operators in the consistent region.

SampleTopology consistentregions

 

A controller monitors for failures and manages region consistency.  Establishing consistent state is coordinated across the consistent region with tuple processing being drained and then checkpointing state.  Recoveries from failure is coordinated with the reset of operator state by suspending the submission of new tuples and resetting operator state to the last consistent state and then replaying tuples from the last consistent state.  The status of a consistent region is shown in the application graph with indications of the operators in a consistent region and when resets are occuring and retry thresholds are being reached.  Streams guarantees that a consistent region will process all data at least once as long as the operators at the start of a consistent region can replay data.  A new replay operator can be used to replay tuples for operators that cannot replay tuples themselves.  Exactly once semantics can be achieved when all operators in a consistent region have at least one of the following characteristics:

  • Can reset their state and the state of any external system they interact with to the last consistent state on reset
  • Can detect duplicates tuples being replayed since the last consistent state and do not process them again
  • Are idempotent (tuples can be processed multiple times without changing the result beyond the initial processing of the tuple)

IBM InfoSphere Streams for Microsoft Excel

The Streams for Excel feature allows an Excel user to quickly and easily identify and access streaming data, to enable analysis and visualization on continually updating data with the full power of Excel.  Streams for Excel is an Excel add-in which uses the Excel Real Time Data (RTD) functionality to make Stream data available in Excel. A simple annotation (@view) is added to the Streams source code on the stream you want to make available.  Excel users will then see the stream in the Streams Excel add-in and can drag and drop the stream (or individual attributes) onto an Excel spreadsheet and the data will automatically begin to flow into Excel.  Excel will get continuously updating data that can be used by the full  functionality of Excel including charts, formulas and cut & paste.

SpreeOpenBetaStreamsForExcel2 1024x537 Whats New in the Streams Open Beta

The add-in also allows pause and resume of the stream data within Excel without affecting the Streams application, sharing spreadsheets where the streams automatically connect when the new user opens the spreadsheet, automatic connection when a job is started and many other features.  These features will be described in more detail in a future article on the Streams for Excel Add-In.

New and Enhanced Toolkits

Open Source toolkits including support for Apache HBase and Apache Kafka

We started the IBMStreams open source community on github last year and we now have over 25 toolkits as well as samples, benchmarks and demos.  These include toolkits which had been developed as part of the product, like inet and messaging, as well as new toolkits from the community for working with mongoDB, HBase, JSON, Parquet and many more.  We included a number of enhancements and new toolkits developed in our github community in the Streams V4 release.  Support for Apache Hbase is provided by the new streamsx.hbase toolkit. Support for Apache kafka has been added to the streamsx.messaging toolkit.  Support for compressed binary files has been added to the streamsx.hdfs toolkit.  Lastly additional operators for working with http have been added to the streamsx.inet toolkit.  Toolkits which have not been included in the v4 release can be downloaded directly from the IBMStreams community on github, and we also welcome any contributions for new toolkits or enhancements in the community.

Timeseries and GeoSpatial toolkit enhancements

The Timeseries toolkit includes new operators for anomaly detection, K-Means clustering, Cross Correlation of multiple timeseries and DSP filtering. It also includes new functions for calculating the distance between two timeseries and enhancements to a number of operators.  The Geospatial toolkit includes new operators for geofencing, hangout detection, spatial routing and spatial grid indexing which make it easier to create powerful applications for location based services like targeted marketing and fraud detection.

 

Join The Discussion