IBM has made a series of announcements recently showing its commitment to Big Data communities including the Open Data Platform (ODP) and Spark.  As part of that commitment Streams is adding a number of new capabilities to make it easier to use Streams within your Big Data environment.  These include the ability to write Streams applications in Java, using Spark MLlib analytics in Streams and running Streams in an Apache Hadoop Yarn cluster.

Streams applications in Java

The Java Application API enables a developer to create streaming applications entirely in Java for IBM Streams. ¬†The Java API is part of the topology toolkit, which plans to allow developers to build streaming topologies (applications) for IBM Streams in different programming languages. ¬†Java is the first language supported and we are looking at other languages like Python and Scala for the future. ¬† The API employs a functional style of programming, allowing¬†definition of a graph’s flow and data manipulation simultaneously. ¬†The excerpt below from the sample grep application builds a simple topology that watches a specified directory for files, reads each file and outputs lines that contain the search term.¬† Thus as each file is added to the directory, the application will read it and output matching lines.

Topology topology = new Topology("Grep");
TStream<String> filePaths = FileStreams.directoryWatcher(topology, directory);
TStream<String> lines = FileStreams.textFileReader(filePaths);
TStream<String> matching = StringStreams.contains(lines, term);

As you can see, the application can be developed without the Java developer needing to know the IBM Streams Programming Language, SPL.  Even though its not SPL, Java applications can use the rich suite of analytics and connectors provided by Streams toolkits.  In addition to using the functions provided by the API and toolkits, you can easily use Java analytics on the stream, allowing you to have common analytics across streaming and batch applications.  The new streamsx.topology toolkit is available now. Continued development will take place in our IBMStreams open source community on github, allowing both us and the community to rapidly evolve this capability.

Spark MLlib toolkit

MLlib¬†is Apache Spark’s scalable machine learning library. The new Streams SparkMLlib toolkit allows a developer to run Sparks Machine learning analytics on Streams. ¬†The toolkit wraps analytics from Spark MLlib and applies them to a stream in IBM Streams. ¬†Models can be built in Spark and then executed in¬†IBM Streams to analyze the data on the stream. ¬†This allows a developer to use the same analytics on data at rest and data in motion with the real-time capabilities of Streams. ¬†The new streamsx.SparkMLlib¬†toolkit is available now and is being developed in the open on our¬†IBMStreams open source community on github.

Support for the Yarn Resource Manager

Hadoop 2.0 included the new resource manager YARN, which allows a Hadoop cluster to be used to run different application frameworks.  YARN allows application frameworks to request cluster resources to execute tasks and release them when they are finished.  The application framework uses containers to ensure that cluster resources can be shared, while preventing applications from consuming resources which have not been allocated to it.   This allows a Hadoop cluster to be used by different workloads with centralized management and control.  Streams has supported YARN since V3.2.1, but we have improved the support so that Streams can be more tightly integrated with resource managers.  A streams domain can be created in a YARN cluster with all Streams hosts requests being handled through YARN.  Streams allows hosts tags to be mapped to YARN container configurations so that a Streams host can request the appropriate type of resources it needs. For example, 2 resources with 4 cores and 32GB of memory and 2 resources with 16 cores and 64GB of memory.  The latest YARN resource manager is provided in the IBMStreams resourceManager repository and is also being developed in the open on our IBMStreams open source community on github.   YARN is the first resource manager supported and we are looking at supporting other resource managers like Mesos in the future.

Join The Discussion