IBM Support

Index and Search Hadoop Documents with Apache Solr

Technical Blog Post


Abstract

Index and Search Hadoop Documents with Apache Solr

Body

This post is applicable to pre BigInsights v4.3 and/or with Solr v5.5 prior.

Introduction

Apache Solr is an open source search platform written in Java. Solr is scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying. However, Solr is not an analytic tool like IBM Text Analytics (ie. SystemT) or machine learning libraries (ie. SystemML).

IBM Open Platform (IOP) is a 100% open source platform. It uses Apache Ambari for provisioning and managing Hadoop clusters. IOP contains many open source components including Apache Solr. By out of box default, the Apache Solr service in IOP has a distributed SolrCloud configuration setup.

Use Case Objective

Create Solr Indexes on existing HDFS documents including csv and binary format. (ex. txt, csv, doc, xls, ppt, pdf, etc)

Version Tested

  • IOP v4.x
  • Apache Solr v5.1.0, v5.2.1

Preparation / Setup (You can skip this section if there is an existing Solr core instance to use)

  1. Create a Solr core via existing template files
    • $ su – solr
    • $ mkdir ~/test (Note: This directory name is going to be your Solr instance name and storing Solr related configuration files)
    • $ cp -R /usr/iop/current/solr-server/server/solr/configsets/data_driven_schema_configs/* ~/test/
    • $ vi ~/test/conf/solrconfig.xml
    • Modify two parameter values (“directoryFactory” and “lockType”) at line #118 line #244. New values detail listed in the screenshot below.
    • Add a new parameter “requestHandler with class “solr.RichDocumentRequestHandler” for indexing binary documents like pdf, Microsoft words, power point, etc.
      Note: Make sure replace the HDFS URL with your cluster URL and port. And add the HDFS directory of where Solr indexes will be stored. ex. hdfs://bigivm.localdomain:8020/solr

      solrconfig
      solrconfig.xml
      solrconfig
      solrconfig.xml
  2. Choose or create a HDFS directory for storing Solr indexes. ex. $ su – solr -c “hadoop dfs -mkdir /solr”
  3. Register this new instance on ZooKeeper. (Start IOP zookeeper service via Ambari if not already) This step needs to perform only once for each new Solr core.
    • /usr/iop/current/solr-server/server/scripts/cloud-scripts/zkcli.sh -cmd upconfig -zkhost [host_FQDN:port] -confname [core_name] -confdir [core_configuration_location] Ex. /usr/iop/current/solr-server/server/scripts/cloud-scripts/zkcli.sh -cmd upconfig -zkhost bigivm.localdomain:2181 -confname test-confdir ~/test/conf
  4. Create Solr new core. This step needs to perform only once for each new Solr core.
    • Start Apache Solr service if not already
    • $ solr create -c test -d ~/test/conf -n test -s 1 -rf 1
      Note: Change the shard and replication factor number to match your need. In this example we created only one shard and one replication factor. (ie. -s 1 -rf 1). This is not sufficient for a production environment.
    • $ cp -R ~/test/* /var/lib/solr/data/test_shard1_replica1/
    • Re-start Apache Solr service. Notice the newly created instance core showed up in the Cloud Graph
      SolrCloud
      SolrCould Web Interface

Indexing and Searching documents

  1. Index Text CSV Files
    • $ hadoop jar /tmp/hadoop-lws-job-2.0.1-0-0-hadoop2.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -DcsvDelimiter=”,” -DcsvFieldMapping=0=id,1=cat,2=name,3=price,4=instock,5=author -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c test -i /tmp/data2/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://bigivm.localdomain:8983/solr
    • Search documents that contain id=”0812550706″
      Solr Search
      Solr Search CVS Data Example
  2. Index Binary Documents via MapReduce Job
      • $ hadoop jar /tmp/hadoop-lws-job-2.0.1-0-0-hadoop2.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -c test -i /tmp/data/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://bigivm.localdomain:8983/solr
        Notes: “-c test” pointing to the core name. “-i /tmp/data/*” pointing to the HDFS directory where documents located. “-s http://hostname_FQDN:8983/solr” is the SolrCloud URL with port. Directory “solr” is the Solr indexes location.
    Solr Indexing
    Solr Indexing HDFS Example
    Solr Search
    Solr Search Binary Data Example

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16260125