IBM has made a series of announcements recently showing its¬†commitment to Big Data communities including¬†the Open Data Platform (ODP) and Spark. ¬†As part of that commitment¬†Streams¬†is adding a number of new capabilities to make it easier to use Streams within your Big Data environment. ¬†These include the ability to write Streams applications in Java, using Spark MLlib analytics in Streams and running Streams in an Apache¬†Hadoop¬†Yarn cluster.
Streams applications in Java
The Java Application API enables a developer to create streaming applications entirely in Java for IBM Streams. ¬†The Java API is part of the topology toolkit, which plans to allow developers to build streaming topologies (applications) for IBM Streams in different programming languages. ¬†Java is the first language supported and we are looking at other languages like Python and Scala for the future. ¬† The API employs a functional style of programming, allowing¬†definition of a graph’s flow and data manipulation simultaneously. ¬†The excerpt below from the sample grep application builds a simple topology that watches a specified directory for files, reads each file and outputs lines that contain the search term.¬† Thus as each file is added to the directory, the application will read it and output matching lines.
Topology topology = new Topology("Grep"); TStream<String> filePaths = FileStreams.directoryWatcher(topology, directory); TStream<String> lines = FileStreams.textFileReader(filePaths); TStream<String> matching = StringStreams.contains(lines, term); matching.print();
As you can see, the application can be developed without the Java developer needing¬†to know the IBM Streams Programming Language, SPL. ¬†Even though its not SPL, Java applications¬†can use the¬†rich suite of analytics and connectors provided by Streams toolkits. ¬†In addition to using the functions provided by the API and toolkits, you can easily use¬†Java analytics on the stream, allowing you to have common analytics across streaming and batch applications. ¬†The new¬†streamsx.topology¬†toolkit is available now. Continued development will take place in our IBMStreams open source community on github, allowing both us and the¬†community¬†to rapidly evolve this capability.
Spark MLlib toolkit
MLlib¬†is Apache Spark’s scalable machine learning library. The new Streams SparkMLlib toolkit allows a developer to run Sparks Machine learning analytics on Streams. ¬†The toolkit wraps analytics from Spark MLlib and applies them to a stream in IBM Streams. ¬†Models can be built in Spark and then executed in¬†IBM Streams to analyze the data on the stream. ¬†This allows a developer to use the same analytics on data at rest and data in motion with the real-time capabilities of Streams. ¬†The new streamsx.SparkMLlib¬†toolkit is available now and is being developed in the open on our¬†IBMStreams open source community on github.
Support for the Yarn Resource Manager
Hadoop 2.0 included the new resource manager YARN, which allows a Hadoop cluster to be used to run different application frameworks. ¬†YARN allows application frameworks to request cluster resources to execute tasks and release them when they are finished. ¬†The application framework uses containers to ensure that cluster resources can be shared, while preventing¬†applications from consuming resources which have not been allocated to it. ¬† This allows a Hadoop cluster to be used by different workloads with centralized management and control. ¬†Streams has supported YARN since V3.2.1, but we have improved the support so that Streams can be more tightly integrated¬†with resource managers. ¬†A streams domain can be¬†created in a¬†YARN cluster with all Streams hosts requests being handled through YARN. ¬†Streams allows hosts tags to be mapped to YARN container configurations so that a Streams host can request the appropriate type of¬†resources it needs. For example, 2 resources¬†with 4 cores and 32GB of memory and 2 resources with 16 cores and 64GB¬†of¬†memory. ¬†The latest YARN resource manager is provided in the IBMStreams resourceManager repository and is also being developed in the open on our¬†IBMStreams open source community on github. ¬†¬†YARN is the first resource manager supported and we are looking at supporting other resource managers like Mesos in the future.