To quote Solr’s webpage, “Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene.” When you’re doing searches, you want results based on current data, not yesterday’s–that’s why IBM Streams integration with Solr make so much sense. Solr brings an easy-to-use, powerful search capability that makes it simple to explore your data, while Streams makes sure the data is there when you need it.

The Solr toolkit has started incubation with three operators:

  • SolrDocumentSink – Add Streams tuple data to your Solr collections in the form of Solr documents.
  • SolrQuery – Enhance your tuple data streams with search results from Solr.
  • LuceneStemmer – Use Lucene stemming capabilities to stem words.

You can check out our source here. As always, we would love for your feedback, contributions, and ideas by opening Github issues. To learn the Solr basics, the Quick Overview is a good starting point.

SolrDocumentSink – Add and Update Data in Solr

Adding data to Solr from IBM Streams gives you the power to leverage Lucene’s near real-time indexing capabilities.

The SolrDocumentSink operator allows you to easily add documents to your Solr collections. By default, every incoming tuple is transformed into a Solr document and individually committed to Solr. For high ingest rates, this can be inefficient, so we provide the option to buffer documents and send to Solr based on buffer size or age. Here is an example where we commit to Solr every 30 tuples, or 5 seconds (whichever comes first):

stream<rstring message> ErrorPort = SolrDocumentSink(Beacon_out)
 {
   param
      uniqueKeyAttribute : id ;
      solrURL : $solrServer ;
      collection : $solrCollection ;
      documentCommitSize : 30 ;
      maxDocumentBufferAge : 5000 ;
 }

The SolrDocumentSink operator also gives you the ability to update existing documents. When updating documents, the default action is to overwrite existing fields. However, we also give you the option to set, add, remove, remove by regex, and increment attributes on a per tuple-attribute basis. This is done by providing an atomicUpdateMap.

Solr Document Update Example: 

To understand this better, let’s look at an illustrative example. We have document “Tech-123-456” which currently contains the id, name, features, and popularity:

<doc>
  <str name="id">Tech-123-456</str>
  <str name="name">42" LCD TV</str>
  <arr name="features">
     <str>Internet</str>
     <str>Webcam</str>
  </arr>
  <int name="popularity">20</int>
</doc>

Let’s assume we want to make the following changes to the document:

  • Set the name to: 42″ LCD Smart TV
  • Add the feature: Voice Command Enabled
  • Increment the popularity by 20

The tuple we send to the SolrDocumentSink operator will look like this:

{
 id = "Tech-123-456",
 name = "42" LCD Smart TV",
 feature = "Voice Command Enabled",
 popularity = 20, 
 atomicUpdateMap = { "name":"set" , "features":"add" , "popularity":"inc" } 
}

*Notice that the id field is not included in the atomicUpdateMap attribute.

After the update, the document will look like this:

<doc>
  <str name="id">Tech-123-456</str>
  <str name="name">42" LCD Smart TV</str>
  <arr name="features">
     <str>Internet</str>
     <str>Webcam</str>
     <str>Voice Command Enabled</str>
  </arr>
  <int name="popularity">40</int>
</doc>

For more SolrDocumentSink examples, look at the sample and the tests.

SolrQuery – Searching Solr

The SolrQuery operator lets you take advantage of Solr’s powerful search capabilities. The operator itself is quite simple. The incoming tuple contains a solr_query attribute containing the query for a specific Solr collection. To get started with Solr’s query syntax, take a look at the CommonQueryParameters Wiki. To help simplify the queries that you need to pass in, we provide three helper parameters:

  • numberOfRows – How many rows to return in a query response.
  • omitHeader – Whether or not to omit response header. Default is true.
  • responseFormat – Format of query response: XML or JSON (Solr default: XML).

When you get more comfortable with Solr queries, things get exciting when you start using faceting and highlighting. The response to your Solr query is output via a solr_response rstring/ustring attribute in XML or JSON format.

Solr Query Example:

Let us assume that we have a Solr collection called techproducts with the following characteristics:

  1. The collection is full of documents containing tech products and the fields defining them (id, name, category, popularity, etc)
  2. The popularity of products is constantly changing based on the sales rate over the last hour.

Our goal is to maintain a list of the 5 most popular products in our Solr collection from the electronics category. We want to know the id, name, and popularity of each item.

Here is the query we will need: q=*:*&rows=5&sort=popularity desc&fq=cat:electronics&fl=id,name,popularity

Breaking apart the query:

  • q=*:*&rows=5&sort=popularity desc – Main query (q=) that matches all documents (*:*), returns 5 rows (rows=5), and sorts them in descending order based on popularity (sort=popularity desc)
  • fq=cat:electronics – Filter query (fq=) that limits results to documents in the electronics category (cat:electronics)
  • fl=id,name,popularity – Field list (fl=) that specifies that the only fields we return from each document are the id, name, and popularity (id,name,popularity)

To try this example out, you will need to start Solr on a local host using the techproducts example with the following command:

$ bin/solr start -p 8983 -e techproducts

The best way to quickly test your queries and get a feel for Solr’s query syntax is to point a web browser at your Solr server and start playing. For the query above, the full browser URL will look like this: http://localhost:8983/solr/techproducts/select/?q=:&rows=5&sort=popularity%20desc&fq=cat:electronics&fl=id,name,popularity

Once you understand the query side, the actual Streams code is quite simple. In the following SPL, a beacon sends the query to the SolrQuery operator every 60 seconds, updating our popularity list every minute. Notice that the numberOfRows parameter is used to limit the number of results to 5.

 stream<rstring solr_query> QueryBeacon = Beacon()
 {
    param
       period : 60.0 ;
    output
       QueryBeacon : solr_query = "q=*:*&sort=popularity%20desc&fq=cat:electronics&fl=id,name,popularity" ;
 }
 
 //URL to do same query: http://localhost:8983/solr/techproducts/select/?q=*:*&rows=5&sort=popularity%20desc&fq=cat:electronics&fl=id,name,popularity
 stream<rstring solr_response> SolrQueryResponse = SolrQuery(QueryBeacon)
 {
    param 
       solrURL : $solrServer;
       collection : "techproducts";
       numberOfRows : 5; 
 }

For more SolrQuery examples, look at the sample and the tests.

 

LuceneStemmer – Stem words to their roots

The LuceneStemmer operator lets you take full words and stem them to a root that is common. For example, “walking” and “walked” are given the same stem of “walk”, allowing your analytics to easily recognize them as words with similar meaning. This can be very helpful for using rules in text processing. Besides stemming, you can also define a stop-word filter and a synonyms filter.

  • Stop-word filter – Removes any words that are defined by a stop-filter file, or if no file is specified, we use the default Lucene StopAnalyzer. For example, you may not want words like “a”, “is”, and “the” to be analyzed, so you have them filtered. You can see the default stop-filter here.
  • Synonyms filter  Consolidates words to a base synonym from user-defined synonyms. For example, if we define “fast” to be a synonym of “quick”, then every time we observe “fast”, it will be converted to “quick.” This can also be useful for auto-correct on misspellings. By defining “acheive” to be a synonym of “achieve”, we can catch spelling mistakes. For an example of the synonym file format, look here.

Lucene Stemmer Example:

Let’s assume we are processing customer service text and we want to be notified of the situations where people are having issues with their new television.
That business rule might look like this: “problem” + “television”

That rule would pick up “I am having a problem with my television”, but what about “I am having trouble with my TV”? To solve this problem, we can use the synonyms filter. Our synonyms.txt file might look something like this:

Problem, Trouble, Broken
Television, Televisions, TV, TVs

Now both sentences will basically be converted to the following token stream: [i, am, have, problem, with, my, television]. This makes rule-making much easier.

Depending on the rules we create, we may know specific words we want to filter to help our accuracy, or to simply limit the amount of text being processed. If we define a stopwords.txt file with the following words:

I 
am 
having 
with 
my

Now the token streams would look like this:  [problem, television]

For a full Streams example, take a look at the LuceneStemmer sample.

 

Join The Discussion