Collections

A collection is an IBM® Streams Processing Language (SPL) component that can be used to represent complex arrangements of streaming data. Collections are supported by built-in SPL functions to access and manipulate the collection contents.

Quick links

Overview
The list collection
The set collection
The map collection
Working with collections
Using primitive operators and collections

Overview

The following collection types are supported:

Set
Map
List

Collections are further extended to represent complex nesting of composite types, including lists of lists. In other words, their elements and, in the case of maps, even their keys can be of any type, including other composite types. Empty list, set, and map literals can occur only where their type is clear from the context: casts, variable initializers, assignments, and typed operator parameters.

As with primitive types such as strings, SPL implements collection types of bounded length. Bounded-length variants follow the same rules as bounded strings. For example, list<int32>[2] can store a maximum of two integers. You can declare bounded-size types with unbounded elements, such as list<rstring>[3], though doing so does not offer the same marshalling optimization opportunities. SPL limits all bounded or unbounded collections to a maximum of 2³¹-1 elements. For types with bounded size, the bounds must be compile-time constants.

You can also declare bounded collections that contain unbounded types, such as a bounded list of rstring. From a developer perspective, there is little difference between bounded and unbounded types. However, there are some compile-time optimization advantages when bounded types are used.

The literal syntax for list values and map values is the same as in Python or JavaScript (including JSON), and the literal syntax for set values is similar to Python or JavaScript.

listLiteral ::= ‘[' expr^*, ‘]'                   # for example, [0, x, x+1, x+2]
mapLiteral  ::= ‘{' ( expr ‘:' expr )^*, ‘}'      # for example, { "Mon": −1, "Fri": 1 }
setLiteral  ::= ‘{ ' expr^*, ‘ }'                 # for example, { "IBM", "GOOG"}

Composite types have their own operators. For more information, see Expression language. You can subscript a list or map with l[i], check membership in a collection with "x in s", iterate over a collection with "for(T x in s) ...", and so on. There are also various functions to work on collections; for example, setUnion(set1, set2) or setIntersection(set1, set2).For maps, the left operand of the in operator is the key. In other words, "Oz" in m checks whether "Oz" is a key of the map m. Value (as opposed to key) membership tests use functions, not the in operator.

Note: Lists are implemented with arrays, but unlike the C language, SPL uses static or dynamic checks to protect against out-of-bounds accesses. Sets and maps are implemented with hash tables. They are unordered and support constant-time lookup.

The list collection

Lists are sequence containers with their contents ordered linearly. Lists allow random, zero-based access to their contents. In simple terms, they can be thought of as dynamic arrays, allowing access to individual list items; but at the same time, they allow for dynamic allocation of new entries where the list is expanded or contracted as needed.

One more advantage over traditional arrays is that lists in SPL provide for tighter boundary checking, ensuring that requests to access list elements beyond the list boundary are prohibited.

The list collection type has the constructors that are shown in the following table, where T is the type of the elements, and n is the size for bounded lists.

Constructor	Description
`list<`T`>`	unbounded
`list<`T`>[`n`]`	bounded

The following example of a list of fruits shows a one-dimensional list that contains six rstring elements.

The following listing shows how to declare and access that simple list.

...
mutable rstring myFruit;
 
// Immutable - must be initialized upon declaration 
 
list<rstring> fruit = ["apple", "orange", "banana", "pear", "pineapple", "apple"];
myFruit = fruit[1];     // orange
...

Note: Lists don't need to be unique (note the two apple elements). Individual elements are accessed by their position index.

The set collection

Sets are unordered containers that enable the retrieval of their contents based on their value (the value of a set element is also its key). Sets store unique items (no duplicates). After an element is added to a set, it cannot be modified. However, list elements can be added or removed and the set expands and contracts as needed.

The following table shows the set collection type constructors, where T is the type of the elements, and n is the size for bounded sets.

Constructor	Description
`set<`T`>`	unbounded
`set<`T`>[`n`]`	bounded

The following example of a set of vehicles shows a simple set of five rstring elements.

The following listing shows how to declare an unordered set.

set <rstring> vehicles = {"car", "truck", "airplane", "boat", "bicycle"};

The map collection

Maps are implemented as associative containers that store elements in a key-to-value pairing. Like sets, a map collection requires that all keys are unique. Key and value items can be declared as different types with full support for nested types (even keys can be of any type, including other composite types).

The following table shows the map collection type constructors, where T is the type of the elements, and n is the size for bounded maps.

Constructor	Description
`map<`K,V`>`	unbounded
`map<`K,V`>[`n`]`	bounded

The following example of a map of months shows a map that uses both an int:rstring and an rstring:int element.

The following listing shows how to declare the map of months by using an int32 key.

map <int32, rstring> monthsYear = {1 : "January", 2 : "February", 3 : "March", ...};

The following example shows a map of days in each month.

The following listing shows how to declare the map of days in each month.

map <rstring, int32> daysMonth = {"January" : 31, "February" : 28, "March" : 31, ...};

Working with collections

Access operators

After you declare the members that are contained in a collection, Streams provides a number of simple access operators. The three basic mechanisms to work with collection items are as follows.

l[i] to access items that are stored in a list.
in to check membership of a collection.
for(T x in s) ... to iterate over the members of a collection.
Note: As a for-loop iterates over a collection, that collection becomes immutable for the body of the for-loop.

Note: When lists or maps are accessed by using the in operator, the operator on the left is the access key. If you want to work with the value rather than the key, you must use the appropriate SPL collection function rather than the in operator.

SPL collection functions

In addition to the access operators, the Streams Processing Language standard toolkit provides several functions that can be used to manipulate collections as part of SPL expressions. SPL functions are either declared as mutable or immutable (non-mutating). A mutating function attempts to modify the contents of a passed input parameter (a collection, in this case). A non-mutating function returns a new object (a collection) that contains the function results. The following listing shows an example.

<list T> public T concat(T values1, T values2)
// Returns a new list of values concatenating the contents of values1 and values2
 
<list T> public void concatM(mutable T values1, T values2)
// Appends values2 to the end of values1

The functions cover a wide range of capabilities that cover basic element access, find functions, sorting, and set-specific manipulation. The following table gives a detailed view of the functions and their application to the various collection types.

Group	Function	Description	List	Set	Map
Adding, joining, removing	`appendM`	Append an item to a list	X
	`clearM`	Clear (empty) a collection	X	X	X
	`concat / concatM`	Concatenate elements.	X	X
	`insert / insertM`	Insert new elements	X	X	X
	`remove / removeM`	Remove elements	X	X	X
Manipulation	`reverse / reverseM`	Reverse list of values	X
Manipulation	`makeSequence`	Make an element sequence	X
Element access	`at`	Access multiple elements	X
	`slice`	Extract elements	X
	`selectFromList`	Select elements from two lists	X
Finding elements	`find`	Find matching values	X
	`has`	Find whether a specified value exists in a collection	X	X	X
	`findFirst`	Find the first occurrence of a matching value in a list	X
	`findFirstNotOf`	Find the first occurrence of a non-matching value in a list	X
	`findFirstOf`	Find the first occurrence of a matching value in a list	X
	`findLast`	Find the last occurrence of a matching value sequence in a list	X
	`findLastNotOf`	Find the last occurrence of a non-matching value in a list	X
	`findLastOf`	Find the last occurrence of a matching value in a list	X
	`findNotOf`	Find non-matching values in a list	X
	`findOf`	Find matching values in a list	X
Size	`size`	Get the size of a list	X	X	X
Size	`countDistinct`	Compute distinct count of a list	X
Comparison	`lexicographicalCompare`	Compare two lists in lexicographical order	X
	`pairwiseCompare`	Compare based on element	X
	`pairwiseMax`	Compare and fetch the larger	X
	`pairwiseMin`	Compare and fetch the smaller	X
Set manipulation	`setDifference`	Compute set difference		X
	`setIntersection`	Compute set intersection		X
	`setUnion`	Compute the union of two sets		X
Element sorting	`sort / sortM`	Sort value	X
Element sorting	`sortIndices`	Sort values and return indices	X
Conversion	`toSet`	Convert a list to a set	X

For more information about a specific function, see Namespace spl.collection.

Using primitive operators and collections

Streams provides seamless exchange of data between native SPL and primitive operators by using its mapping of SPL types to C++ types (normally achieved through extending common C++ classes.) This exchange is true for the standard primitive and more complex (and nested) composite types. The following table shows the mapping of SPL and C++ types. The C++ base implementation gives some guidance on the behaviors of each SPL type.

SPL type	C++ type	C++ base implementation	C++ reflective type
`list<T>`	`SPL::list<T>`	`std::vector<T>`	`SPL::List`
`set<T>`	`SPL::set<T>`	`std::tr1::unordered_set<T>`	`SPL::Set`
`map<T,K>`	`SPL::map<T,K>`	`std::tr1::unordered_map<T,K>`	`SPL::Map`
`list<T>[N]`	`SPL::blist<T,N>`	N/A	`SPL::BList`
`set<T>[N]`	`SPL::bset<T,N>`	N/A	`SPL::BSet`
`map<T,K>[N]`	`SPL::bmap<T,K,N>`	N/A	`SPL::BMap`

Example

This example demonstrates the passing of both unbounded and bounded composite types between SPL and a C++ primitive operator. The following figure shows the application flow at a high level.

The following listing shows the simple scenario of passing a nested composite type to a Streams C++ primitive operator. The type is declared as a simple, two-dimensional list of integers, which is averaged and returned as a one-dimensional list of float values. For simplicity, the initial list is populated from a file source that reads a tuple of the correct format.

composite Main {
 
     // Tuple schema definitions for re-use later
     type
          // Unbounded list of list of int32       
          listSchema2d = tuple <list<list<int32> > inputMatrix>;
 
          // Bounded list               
          listSchema1d = tuple <list<float64>[100] outputList>; 
         
     ...
 
    graph
 
     // Read a CSV file containing 100x100 matrix of random integers
     stream <listSchema2d> matrixData = FileSource() {
          param 
               format : csv;
               file : "matrix.txt";
     }
 
     // Pass the matrix data to the Primitive Operator for processing
     stream <listSchema1d> listData = myOp(matrixData) {} 
 
     // Sink the results of the Primitive Operator to a new file
     () as sinkListData = FileSink(listData) {
          param 
               format : csv;
               file : "list.txt";
     }
     ...
}

The following listing describes the next step, which iterates over the input matrix to compute the average for each row. A list that contains the average for each row is added to the output tuple..

...
     
// Tuple processing for non-mutating ports
void MY_OPERATOR::process(Tuple const & tuple, uint32_t port)
{
 
     // Define the input and output port tuples
     IPort0Type const & tp = static_cast<IPort0Type const&>(tuple);
     OPort0Type outputTuple;
 
     // Get a reference to the 2d list of list of int32
     SPL::list<SPL::list<SPL::int32> > const & inputMatrix = tp.get_inputMatrix();
     // Could also use IPort0Type::inputMatrix_type const& ....
 
     // Loop over the input matrix and calculate the average of each row
     for (uint32_t dim1=0, dim1Max = inputMatrix.size(); dim1 < dim1Max; dim1++) {
 
          SPL::list<SPL::int32> const& row = inputMatrix[dim1];            
          // Could also use IPort0Type::inputMatrix_type::value_type const& ...
         
          // Sum the data values in the row
          SPL::float64 sum = 0.0;
          for (uint32_t dim2=0, dim2Max = row.size(); dim2 < dim2Max; dim2++)
               sum += row[dim2];
 
          // Append the average of the row to the output tuple
          outputTuple.get_outputList().push_back (sum / row.size());
     }
 
     // Submit output tuple
     submit(outputTuple, 0); 
}
...