Solr, available out of the box in the IBM Open Platform with Apache Hadoop, is a powerful indexing engine that can discover schema automatically. Once your data is indexed, you can run (fast) queries against Solr to find insights in your data. In this blog, we will share a one-liner that will index any tweets.

Tweets are JSON documents, and the JSON schema is fairly involved (partial schema is listed below). You can certainly transform the schema into Solr’s structure, load it, and then index the data. That is time-consuming and error-prone. Alternatively, you can add them to Solr by including some parameters with the update request. These externally-defined parameters provide information on how to split a single JSON file into multiple Solr documents and how to map fields to Solr’s schema. Solr will self-discover the data types for these fields and assigns the most appropriate. The update request handler also adds new fields to existing collections if they did not exist before (as show in this blog).

So, let’s take a look at tweet JSON

root
 |-- actor: struct (nullable = true)
 |    |-- displayName: string (nullable = true)
 |    |-- favoritesCount: integer (nullable = true)
 |    |-- followersCount: integer (nullable = true)
 |    |-- friendsCount: integer (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- image: string (nullable = true)
      ...
 |-- id: string (nullable = true)
 ...
 |    |-- links: array (nullable = true)
 |    |    |-- element: struct (containsNull = false)
 |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- rel: string (nullable = true)
 |    |-- listedCount: integer (nullable = true)
 |    |-- location: struct (nullable = true)
 ...  
 |    |-- twitterTimeZone: string (nullable = true)
 |    |-- utcOffset: string (nullable = true)
 |    |-- verified: boolean (nullable = true)
 |-- body: string (nullable = true)
 |-- favoritesCount: integer (nullable = true)
 |-- generator: struct (nullable = true)
 |    |-- displayName: string (nullable = true)
 |    |-- link: string (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = false)
 |    |-- type: string (nullable = true)
 |-- postedTime: string (nullable = true)
 ...

Mapping this to a real tweet:

sample tweet
sample tweet

In the above sample tweet, /actor/displayName is “Brian Sutorius”, /postedTime is “1:29 PM – 21 Feb 2012”, /body is the tweet text itself, and /actor/favoritesCount is 2,350. A few more fields that maybe interesting to include are /id which is a unique handle to all the tweets in the world, /actor/image the profile image URL of the twitter user, and /actor/followersCount indicating the user’s popularity.

Now, let’s index tweets!

Say now we want to index on these fields in millions of tweets. You simply issue the following curl command (one line):

curl 'http://myhost.svl.ibm.com:8983/solr/distributedtweets/update/json/docs''?split=/''&f=who:/actor/displayName''&f=posteddt:/postedTime''&f=tweet:/body''&f=tweetid:/id''&f=pic:/actor/image''&f=fc:/actor/followersCount''&f=favc:/actor/favoritesCount' -H 'Content-type:application/json' --data-binary @/mylocaldir/twitter_data/2015/08/31/2015_08_31_03_10_activity.json 

Let’s break it down:

curl ‘http://myhost.svl.ibm.com:8983/solr/distributedtweets/update/json/docs’
‘?split=/’
‘&f=who:/actor/displayName’
‘&f=posteddt:/postedTime’
‘&f=tweet:/body’
‘&f=tweetid:/id’
‘&f=pic:/actor/image’
‘&f=fc:/actor/followersCount’
‘&f=favc:/actor/favoritesCount’
-H ‘Content-type:application/json’
–data-binary @/mylocaldir/twitter_data/2015/08/31/2015_08_31_03_10_activity.json

distributedtweets is the index name, or core as Solr calls it. You can create a core with the following command:

./bin/solr create_collection -c distributedtweets -d data_driven_schema_configs -shards 1 -replicationFactor 1

?split=/ is the API to tell the indexer to split the incoming JSON from '/', or root; all the following f= parameters tell Solr to map Solr fields to tweet’s JSON feilds, i.e., 'who', 'posteddt', 'tweet', 'tweetid', 'pic', 'fc', 'favc' to '/actor/displayName', '/postedTime', '/body', 'id', '/actor/image', '/actor/followersCount', '/actor/favoritesCount'.

-H 'Content-type:application/json' is the extra header to include in the curl request to properly ensure json documents.

And finally, the json file (containing one tweet per line — this is important), /mylocaldir/twitter_data/2015/08/31/2015_08_31_03_10_activity.json, with entries such as:

{"id":"tag:search.twitter.com,2005:628827778926800896","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:173153898"...
{"id":"tag:search.twitter.com,2005:628827774724214784","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:2896787807"...
{"id":"tag:search.twitter.com,2005:628827773897850880","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:2701721294"...

And when you run the one-line curl command, you should see output from your terminal that’s similar to the following:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1499M    0    46  100 1499M      0  20.3M  0:01:13  0:01:13 --:--:-- 18.0M
{"responseHeader":{"status":0,"QTime":73571}}

OK, tweets are indexed, let’s search!

The Solr fields, 'who', 'posteddt', 'tweet', 'tweetid', 'pic', 'fc', 'favc', are now usable in your Solr queries, for example, below is the search command of “star wars” on all tweets:

http://myhost.svl.ibm.com:8983/solr/distributedtweets/select?q=star+AND+wars&wt=json&indent=true&rows=100

And return results in json format, with indentation and first 100 rows:

{
  "responseHeader":{
    "status":0,
    "QTime":5,
    "params":{
      "q":"star AND wars",
      "indent":"true",
      "rows":"100",
      "wt":"json"}},
  "response":{"numFound":16230,"start":0,"docs":[
      {
        "tweetid":"tag:search.twitter.com,2005:628243496017768449",
        "tweet":"Star Wars Lego Movie - Star Wars Lego Movie Lego star wars – etc etc.. http://t.co/jtJrPtMXIV",
        "who":"bike speed",
        "pic":"https://pbs.twimg.com/profile_images/1413706949/01052011243_normal.jpg",
        "posteddt":"2015-08-03T16:38:16Z",
        "fc":309,
        "favc":168,
        "id":"278536b6-801e-4d7f-a22a-7d1558c6b00a",
        "_version_":1515704704655425539},
...

More…

Thanks to Carita Ou for contributing to this blog!

We are working on more blogs on Solr in the areas of indexing HDFS files, tuning for performance using shards, and complex Solr queries to find insights in your data. Stay tuned!

9 comments on"How-to: Index Tweets with Solr Using 1 Line of Code"

  1. Jewel Coldham December 05, 2016

    With theater tickets available by registered street
    vendors everywhere, discovering the right show is
    an easy task. London is often a favorite partfy town withh myriad residents, tourists and
    celebrities since it combines the highest-class bars with all the grimiest pubs in a
    way that ensures everybody are able to fiond what
    they’re after. A short walk from Trafcalgar square will give you to such
    cubs as Floridita, where bands flown in from Cuba give you a tropical, Latin experience.

  2. Alice Vanderpool December 05, 2016

    Not only does it hold 20,000 music fans by itself, yyet it’s surrounded by other, smaller venues which make the whole area
    one big festival of sound. So if you’re seeking to make probably the most of an stag night without spending hours beforehand planning, London truly is among the greatest cities you could try
    your luck in. Your first sight will be Nelsons Column guarded
    bby four majestic lions.

  3. Arleen Newquist December 05, 2016

    And whatever area you’re thrill-seeking in, you’resure to
    find a helpful delights for the stag particular date,
    whether it is inside ever-changingnight scene of Camden,
    the shocking streets of Soho or finishing the night
    having a takeaway through the famous Brick Lanne area.
    Whether you are the lucky man himself or best man looking
    to provide the groom the evening off his life, there’s a great
    dal to think about that it could seem overwhelming. A short wwalk from Trafalgar square will take you to such clubs as
    Floridita, where bands flown in from Cuba give a tropical, Latin experience.

  4. Wiith theater tickets avilable by registered street vendors
    everywhere, choosing the best show is a simple task.
    The pul from the London music scene appears to bee it’s improving and
    greater noww as a great number of originate from around the glopbe that can come and visit.
    Your furst sight is going to be Nelsons Column guarded
    by four majestic lions.

  5. Tiffany Teague March 10, 2017

    They’re very good and helpful additionally fast took my
    laptop in on Monday picked it up on Wednesday.

  6. Joanne Street March 11, 2017

    It is still value defending yourself although, as
    just lately there’s been an rising variety of viruses for Macs circulating the online.

  7. Hilda Stegall March 12, 2017

    A current research in the Journal of the American Society of
    Plastic Surgeons reveals that weight loss may also help,
    however even this is unpredictable.

  8. Beth Mackersey August 02, 2017

    I am in fact pleased to glance at this blog posts which consists of
    plenty of valuable facts, thanks for providing these data.

Join The Discussion

Your email address will not be published. Required fields are marked *