Collections

A collection is an IBM® Streams Processing Language (SPL) component that can be used to represent complex arrangements of streaming data. Collections are supported by built-in SPL functions to access and manipulate the collection contents.

Quick links

Overview

The following collection types are supported:
  • Set
  • Map
  • List
Collections are further extended to represent complex nesting of composite types, including lists of lists. In other words, their elements and, in the case of maps, even their keys can be of any type, including other composite types. Empty list, set, and map literals can occur only where their type is clear from the context: casts, variable initializers, assignments, and typed operator parameters.
Streams primitive and composite types

As with primitive types such as strings, SPL implements collection types of bounded length. Bounded-length variants follow the same rules as bounded strings. For example, list<int32>[2] can store a maximum of two integers. You can declare bounded-size types with unbounded elements, such as list<rstring>[3], though doing so does not offer the same marshalling optimization opportunities. SPL limits all bounded or unbounded collections to a maximum of 231-1 elements. For types with bounded size, the bounds must be compile-time constants.

You can also declare bounded collections that contain unbounded types, such as a bounded list of rstring. From a developer perspective, there is little difference between bounded and unbounded types. However, there are some compile-time optimization advantages when bounded types are used.

The literal syntax for list values and map values is the same as in Python or JavaScript (including JSON), and the literal syntax for set values is similar to Python or JavaScript.
listLiteral ::= ‘[' expr*,]'                   # for example, [0, x, x+1, x+2]
mapLiteral  ::= ‘{' ( expr:' expr )*,}'      # for example, { "Mon": −1, "Fri": 1 }
setLiteral  ::= ‘{ ' expr*, }'                 # for example, { "IBM", "GOOG"}

Composite types have their own operators. For more information, see Expression language. You can subscript a list or map with l[i], check membership in a collection with "x in s", iterate over a collection with "for(T x in s) ...", and so on. There are also various functions to work on collections; for example, setUnion(set1, set2) or setIntersection(set1, set2).For maps, the left operand of the in operator is the key. In other words, "Oz" in m checks whether "Oz" is a key of the map m. Value (as opposed to key) membership tests use functions, not the in operator.

Note: Lists are implemented with arrays, but unlike the C language, SPL uses static or dynamic checks to protect against out-of-bounds accesses. Sets and maps are implemented with hash tables. They are unordered and support constant-time lookup.

The list collection

Lists are sequence containers with their contents ordered linearly. Lists allow random, zero-based access to their contents. In simple terms, they can be thought of as dynamic arrays, allowing access to individual list items; but at the same time, they allow for dynamic allocation of new entries where the list is expanded or contracted as needed.

One more advantage over traditional arrays is that lists in SPL provide for tighter boundary checking, ensuring that requests to access list elements beyond the list boundary are prohibited.

The list collection type has the constructors that are shown in the following table, where T is the type of the elements, and n is the size for bounded lists.
Constructor Description
list<T> unbounded
list<T>[n] bounded

The following example of a list of fruits shows a one-dimensional list that contains six rstring elements.

A list of fruits
The following listing shows how to declare and access that simple list.
...
mutable rstring myFruit;
 
// Immutable - must be initialized upon declaration 
 
list<rstring> fruit = ["apple", "orange", "banana", "pear", "pineapple", "apple"];
myFruit = fruit[1];     // orange
...
Note: Lists don't need to be unique (note the two apple elements). Individual elements are accessed by their position index.

The set collection

Sets are unordered containers that enable the retrieval of their contents based on their value (the value of a set element is also its key). Sets store unique items (no duplicates). After an element is added to a set, it cannot be modified. However, list elements can be added or removed and the set expands and contracts as needed.

The following table shows the set collection type constructors, where T is the type of the elements, and n is the size for bounded sets.
Constructor Description
set<T> unbounded
set<T>[n] bounded

The following example of a set of vehicles shows a simple set of five rstring elements.

A set of vehicles
The following listing shows how to declare an unordered set.
set <rstring> vehicles = {"car", "truck", "airplane", "boat", "bicycle"};

The map collection

Maps are implemented as associative containers that store elements in a key-to-value pairing. Like sets, a map collection requires that all keys are unique. Key and value items can be declared as different types with full support for nested types (even keys can be of any type, including other composite types).

The following table shows the map collection type constructors, where T is the type of the elements, and n is the size for bounded maps.
Constructor Description
map<K,V> unbounded
map<K,V>[n] bounded
The following example of a map of months shows a map that uses both an int:rstring and an rstring:int element.
A map of days in each month
The following listing shows how to declare the map of months by using an int32 key.
map <int32, rstring> monthsYear = {1 : "January", 2 : "February", 3 : "March", ...};
The following example shows a map of days in each month.
A map of months
The following listing shows how to declare the map of days in each month.
map <rstring, int32> daysMonth = {"January" : 31, "February" : 28, "March" : 31, ...};

Working with collections

Access operators
After you declare the members that are contained in a collection, Streams provides a number of simple access operators. The three basic mechanisms to work with collection items are as follows.
  • l[i] to access items that are stored in a list.
  • in to check membership of a collection.
  • for(T x in s) ... to iterate over the members of a collection.
    Note: As a for-loop iterates over a collection, that collection becomes immutable for the body of the for-loop.
Note: When lists or maps are accessed by using the in operator, the operator on the left is the access key. If you want to work with the value rather than the key, you must use the appropriate SPL collection function rather than the in operator.
SPL collection functions
In addition to the access operators, the Streams Processing Language standard toolkit provides several functions that can be used to manipulate collections as part of SPL expressions. SPL functions are either declared as mutable or immutable (non-mutating). A mutating function attempts to modify the contents of a passed input parameter (a collection, in this case). A non-mutating function returns a new object (a collection) that contains the function results. The following listing shows an example.
<list T> public T concat(T values1, T values2)
// Returns a new list of values concatenating the contents of values1 and values2
 
<list T> public void concatM(mutable T values1, T values2)
// Appends values2 to the end of values1
The functions cover a wide range of capabilities that cover basic element access, find functions, sorting, and set-specific manipulation. The following table gives a detailed view of the functions and their application to the various collection types.
Group Function Description List Set Map
Adding, joining, removing appendM Append an item to a list X    
clearM Clear (empty) a collection X X X
concat / concatM Concatenate elements. X X  
insert / insertM Insert new elements X X X
remove / removeM Remove elements X X X
Manipulation reverse / reverseM Reverse list of values X    
makeSequence Make an element sequence X    
Element access at Access multiple elements X    
slice Extract elements X    
selectFromList Select elements from two lists X    
Finding elements find Find matching values X    
has Find whether a specified value exists in a collection X X X
findFirst Find the first occurrence of a matching value in a list X    
findFirstNotOf Find the first occurrence of a non-matching value in a list X    
findFirstOf Find the first occurrence of a matching value in a list X    
findLast Find the last occurrence of a matching value sequence in a list X    
findLastNotOf Find the last occurrence of a non-matching value in a list X    
findLastOf Find the last occurrence of a matching value in a list X    
findNotOf Find non-matching values in a list X    
findOf Find matching values in a list X    
Size size Get the size of a list X X X
countDistinct Compute distinct count of a list X    
Comparison lexicographicalCompare Compare two lists in lexicographical order X    
pairwiseCompare Compare based on element X    
pairwiseMax Compare and fetch the larger X    
pairwiseMin Compare and fetch the smaller X    
Set manipulation setDifference Compute set difference   X  
setIntersection Compute set intersection   X  
setUnion Compute the union of two sets   X  
Element sorting sort / sortM Sort value X    
sortIndices Sort values and return indices X    
Conversion toSet Convert a list to a set X    
For more information about a specific function, see Namespace spl.collection.

Using primitive operators and collections

Streams provides seamless exchange of data between native SPL and primitive operators by using its mapping of SPL types to C++ types (normally achieved through extending common C++ classes.) This exchange is true for the standard primitive and more complex (and nested) composite types. The following table shows the mapping of SPL and C++ types. The C++ base implementation gives some guidance on the behaviors of each SPL type.
SPL type C++ type C++ base implementation C++ reflective type
list<T> SPL::list<T> std::vector<T> SPL::List
set<T> SPL::set<T> std::tr1::unordered_set<T> SPL::Set
map<T,K> SPL::map<T,K> std::tr1::unordered_map<T,K> SPL::Map
list<T>[N] SPL::blist<T,N> N/A SPL::BList
set<T>[N] SPL::bset<T,N> N/A SPL::BSet
map<T,K>[N] SPL::bmap<T,K,N> N/A SPL::BMap

Example

This example demonstrates the passing of both unbounded and bounded composite types between SPL and a C++ primitive operator. The following figure shows the application flow at a high level.
Application flow
The following listing shows the simple scenario of passing a nested composite type to a Streams C++ primitive operator. The type is declared as a simple, two-dimensional list of integers, which is averaged and returned as a one-dimensional list of float values. For simplicity, the initial list is populated from a file source that reads a tuple of the correct format.
composite Main {
 
     // Tuple schema definitions for re-use later
     type
          // Unbounded list of list of int32       
          listSchema2d = tuple <list<list<int32> > inputMatrix>;
 
          // Bounded list               
          listSchema1d = tuple <list<float64>[100] outputList>; 
         
     ...
 
    graph
 
     // Read a CSV file containing 100x100 matrix of random integers
     stream <listSchema2d> matrixData = FileSource() {
          param 
               format : csv;
               file : "matrix.txt";
     }
 
     // Pass the matrix data to the Primitive Operator for processing
     stream <listSchema1d> listData = myOp(matrixData) {} 
 
     // Sink the results of the Primitive Operator to a new file
     () as sinkListData = FileSink(listData) {
          param 
               format : csv;
               file : "list.txt";
     }
     ...
}
The following listing describes the next step, which iterates over the input matrix to compute the average for each row. A list that contains the average for each row is added to the output tuple..
...
     
// Tuple processing for non-mutating ports
void MY_OPERATOR::process(Tuple const & tuple, uint32_t port)
{
 
     // Define the input and output port tuples
     IPort0Type const & tp = static_cast<IPort0Type const&>(tuple);
     OPort0Type outputTuple;
 
     // Get a reference to the 2d list of list of int32
     SPL::list<SPL::list<SPL::int32> > const & inputMatrix = tp.get_inputMatrix();
     // Could also use IPort0Type::inputMatrix_type const& ....
 
     // Loop over the input matrix and calculate the average of each row
     for (uint32_t dim1=0, dim1Max = inputMatrix.size(); dim1 < dim1Max; dim1++) {
 
          SPL::list<SPL::int32> const& row = inputMatrix[dim1];            
          // Could also use IPort0Type::inputMatrix_type::value_type const& ...
         
          // Sum the data values in the row
          SPL::float64 sum = 0.0;
          for (uint32_t dim2=0, dim2Max = row.size(); dim2 < dim2Max; dim2++)
               sum += row[dim2];
 
          // Append the average of the row to the output tuple
          outputTuple.get_outputList().push_back (sum / row.size());
     }
 
     // Submit output tuple
     submit(outputTuple, 0); 
}
...