Overview

Solr is a fast search engine built on top of Apache Lucene. It provides amazing speed for indexing documents and searching capabilities. Solr 5.5.0 is available in IOP 4.2. This document will walk through the basic CRUD operations for Solr collections and indexes.

Solr Terminology Simplified

solr_cloud

  • A Collection is a logical index made up of one or more shards.
  • A Shard is a logical slice of a collection, replicated over a number of servers.
  • Each logical Shard corresponds to a physical Core.
  • A Core is a physical index of documents. A collection can have multiple cores, where each core contains a subset of documents in the index.
  • A Replica is a copy of a core located in another node.
  • SolrCloud is a cluster of Solr servers managed by Zookeeper as a single unit.

Details on logical and physical concepts of SolrCloud can be found at Apache Solr Reference Guide: Solr Cloud.

Solr in IOP 4.2

Solr services and environment configurations are managed by Ambari. By default, SolrCloud mode is started when starting Solr from Ambari and Solr is configured to read and write indexes to HDFS. Solr that is installed on additional nodes within the cluster will automatically join the running SolrCloud instance when started.

The LAPD Crime and Collision sample data set from catalog.data.gov will be used in the examples for this article. It is available at data.gov: https://catalog.data.gov/dataset/lapd-crime-and-collision-raw-data-for-2015.

Create Collections

A few things to know before creating a Solr collection:

  • SOLR_INCLUDE environment script overrides default values that are used by bin/solr script. This file is located in /usr/iop/current/solr-server/conf.
  • A Configset is used when creating a collection. This is a set of shared configurations under a base directory. More about configsets will come in the next section.

To create a Solr collection named “lapd” with 2 shards using the data_driven_schema_configs config set and have a replication factor of 2, run the following command:

# SOLR_INCLUDE=/etc/solr/conf/solr.in.sh /usr/iop/current/solr-server/bin/solr create -c lapd -s 2 -d data_driven_schema_configs -rf 2

Connecting to ZooKeeper at hostname1.abc.com,hostnam2.abc.com/solr ...
Uploading /usr/iop/current/solr-server/server/solr/configsets/data_driven_schema_configs/conf for config lapd to ZooKeeper at hostname1.abc.com,hostname2.abc.com/solr

Creating new collection 'lapd' using command:
http://hostname1.abc.com:8983/solr/admin/collections?action=CREATE&name=lapd&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=lapd

{
  "responseHeader":{
    "status":0,
    "QTime":3152},
  "success":{"":{
      "responseHeader":{
        "status":0,
        "QTime":2971},
      "core":"lapd_shard1_replica2"}}}

Identical command using curl:

curl 'http://hostname1.abc.com:8983/solr/admin/collections?action=CREATE&name=lapd&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=lapd'

Notice the configset was not specified in the curl command above. This is because by default, Solr uses data_driven_schema_configs for all new collections. This config set must be uploaded to Zookeeper before the collection can be created. Refer to the command “upconfig” explained in the next section.

state_json

In SolrCloud under the newly created collection, state.json shows the current state of this collection. Two shards were created on each Solr node, each containing two cores where one is a replica of the other. The core names were automatically generated to be <collection_name>_<shard#>_<replica#>.

ConfigSets

Solr packages three config sets to use out-of-the-box. These are located under /usr/iop/current/solr-server/server/solr/configsets.

  • basic_configs: This configset contains the minimal Solr configuration required for a collection.
  • data_driven_schema_configs: This configset auto-populates the schema with guessed field types.
  • sample_techproducts_configs: This configset comes with many features enabled to take advantage of Solr’s power.

Create ConfigSets

In SolrCloud, when a new collection is created, the configset specified will be uploaded into Zookeeper for management and linked to the collection. The configsets in Zookeeper may be shared among several collections by linking the configset to another collection. Be cautious when linking the data_driven_schema_configs configset. Since the schema and fields are auto-generated, it may not satisfy the requirements from the other collection.

To upload the default data_driven_schema_config configSet to Zookeeper, run the following command:

# /usr/iop/current/solr-server/server/scripts/cloud-scripts/zkcli.sh -zkhost hostname1.abc.com,hostname2.abc.com/solr -cmd upconfig -confname default_data_driven_schema_config -confdir /usr/iop/current/solr-server/server/solr/configsets/data_driven_schema_configs

upconfig

The three out-of-the-box configsets can be copied to another directory to form a new personalized config set and configured to suit your needs. When creating a new collection, pass in the location of the customized config set: -d /tmp/path/to/myconfig. The newly created configset can also be uploaded to Zookeeper via zkcli.sh script (independent of collection creation), located in /usr/iop/current/solr-server/server/scripts/cloud-scripts/zkcli.sh. This script is not the same as Zookeeper’s zkCli.sh; it is specific to Solr. Solr also has a ConfigSets API to create, delete, and list configsets. Details and examples can be found here: https://cwiki.apache.org/confluence/display/solr/ConfigSets+API.

Read Configsets

There are two ways to see the configuration files of a configset in Zookeeper. In SolrCloud Admin UI, on the left panel, navigate to /Cloud/Tree/configs/lapd; or download the files via zkcli.sh, using the get command or getfile command. get will return the file contents on the console, whereas getfile will save the contents of the file into a file on the system.

# /usr/iop/current/solr-server/server/scripts/cloud-scripts/zkcli.sh -zkhost hostname1.abc.com,hostname2.abc.com/solr -cmd getfile /configs/lapd/managed-schema managed-schema.local

Another option to read configsets is to download the entire conf directory to get all the files at once.

# /usr/iop/current/solr-server/server/scripts/cloud-scripts/zkcli.sh -zkhost hostname1.abc.com,hostname2.abc.com/solr -cmd downconfig -confname lapd -confdir /path/to/save/directory/lapdConf.local

Update Configsets

data_driven_schema_configs configset is in schemaless mode, configured to use Managed Schema in solrconfig.xml.

<schemaFactory class="ManagedIndexSchemaFactory">
   <bool name="mutable">true</bool>
   <str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>

Schemaless mode allows users to automatically create an effective schema by indexing sample data. Users will not modify the schema manually; instead, the Schema REST APIs should be used. The schema can only be modified when mutable is set to true. Details and examples can be found here: https://cwiki.apache.org/confluence/display/solr/Schema+API. When modifying the configs with the Schema API, all documents previously indexed will have to be reindexed to use the updated schema. Documents indexed after the schema change will be using the updated schema.

Two files to take note are solrconfig.xml and managed-schema.

Delete Configsets

Similar to upconfig and downconfig, a configset can be deleted from Zookeeper using the zkcli.sh script’s clear command. To remove the defaultConf configset created earlier, run the following command:

# /usr/iop/current/solr-server/server/scripts/cloud-scripts/zkcli.sh -zkhost hostname1.abc.com,hostname2.abc.com/solr -cmd clear /configs/data_driven_schema_configs

Index Files

The LAPD Crime and Collision sample data set (from data.gov) comes in multiple formats, CSV, JSON, and XML. Either one can be indexed and have the same results. Sample data:

Date Rptd,DR. NO,DATE OCC,TIME OCC,AREA,AREA NAME,RD,Crm Cd,Crm Cd Desc,Status,Status Desc,LOCATION,Cross Street,Location 1
12/02/2015 12:00:00 AM,150126705,12/02/2015 12:00:00 AM,0150,01,Central,0145,946,OTHER MISCELLANEOUS CRIME,IC,Invest Cont,  400 S  LOS ANGELES ST,,"(34.0473, -118.2462)"
12/02/2015 12:00:00 AM,150126706,12/02/2015 12:00:00 AM,0220,01,Central,0145,330,BURGLARY FROM VEHICLE,IC,Invest Cont, LOS ANGELES, WINSTON,"(34.0467, -118.2470)"
12/02/2015 12:00:00 AM,150126763,12/02/2015 12:00:00 AM,1110,01,Central,0162,442,SHOPLIFTING - PETTY THEFT ($950 & UNDER),IC,Invest Cont, 700 W 7TH ST,,"(34.0480, -118.2577)"

Looking at this data, we can see it provides information on what area the crime took place, the date it occurred, when it was reported, etc. There are over 200,000 rows of data. Manually looking over each row to determine which area has the highest crime rate is a tedious process and error prone. Solr can do majority of this work and we just need to know what question to ask it.

First, index the CSV data using the following command with the SimplePostTool:

# /usr/iop/current/solr-server/bin/post -c lapd LAPD_Crime_and_Collision_Raw_Data_2015.csv

228017 documents were indexed, the same number of rows in the CSV file excluding the header row.

Now that the documents are indexed, let’s revisit the managed-schema file in Zookeeper. Majority of the content in this file was rearranged from its original format. In addition, new fields were added:

  <field name="AREA" type="tlongs"/>
  <field name="AREA_NAME" type="strings"/>
  <field name="Crm_Cd" type="tlongs"/>
  <field name="Crm_Cd_Desc" type="strings"/>
  <field name="Cross_Street" type="strings"/>
  <field name="DATE_OCC" type="strings"/>
  <field name="DR._NO" type="tlongs"/>
  <field name="Date_Rptd" type="strings"/>
  <field name="LOCATION" type="strings"/>
  <field name="Location_1" type="strings"/>
  <field name="RD" type="tlongs"/>
  <field name="Status" type="strings"/>
  <field name="Status_Desc" type="strings"/>
  <field name="TIME_OCC" type="tlongs"/>

These fields correspond to the column names of the CSV file that was indexed. Solr was able to automatically guess the type of each field. Run a simple query to see how the data looks: http://hostname1.abc.com:8983/solr/lapd_shard1_replica1/select?q=*&rows=1&wt=json&indent=true

{
  "responseHeader":{
    "status":0,
    "QTime":13,
    "params":{
      "q":"*",
      "indent":"true",
      "rows":"1",
      "wt":"json"}},
  "response":{"numFound":228017,"start":0,"maxScore":1.0,"docs":[
      {
        "AREA":[1],
        "RD":[145],
        "Status":["IC"],
        "LOCATION":["  400 S  LOS ANGELES                  ST"],
        "id":"256b834b-4dbf-4c75-942a-55c10576850c",
        "_version_":1537877550409187328,
        "Date_Rptd":["12/02/2015 12:00:00 AM"],
        "DR._NO":[150126705],
        "DATE_OCC":["12/02/2015 12:00:00 AM"],
        "TIME_OCC":[150],
        "AREA_NAME":["Central"],
        "Crm_Cd":[946],
        "Crm_Cd_Desc":["OTHER MISCELLANEOUS CRIME"],
        "Status_Desc":["Invest Cont"],
        "Location_1":"(34.0473, -118.2462)"}]
  }}

For more complex documents, such as Twitter json files, indexing with a custom schema is more suitable. A tutorial of how to do so can be found here: How-to: Index Tweets with Solr Using 1 Line of Code

Update Index

For whatever reason, we want to change field name “AREA” from type long to string. This can be updated using the schema API to modify the field type.

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field":{
     "name":"AREA",
     "type":"strings" }
}' http://hostname.abc.com:8983/solr/lapd/schema

Verify by going to Solr UI and view the managed-schema file under lapd’s configSet.

  <field name="AREA" type="strings"/>

Reindex the data with the updated schema by restarting Solr from Ambari UI, or reloading the cores. When using the Schema API to modify the schema, a core reload will automatically occur. Run the simple query again and see the updated AREA value as a string.

{
  "responseHeader":{
    "status":0,
    "QTime":22,
    "params":{
      "q":"*",
      "indent":"true",
      "rows":"1",
      "wt":"json"}},
  "response":{"numFound":228017,"start":0,"maxScore":1.0,"docs":[
      {
        "AREA":["1"],
        "RD":[174],
        "Status":["IC"],
        "LOCATION":["  100 E  9TH                          ST"],
        "id":"1bac9dcd-6ffb-46a5-b00d-7562b3c51965",
        "_version_":1537877550412333056,
        "Date_Rptd":["12/02/2015 12:00:00 AM"],
        "DR._NO":[150126766],
        "DATE_OCC":["12/02/2015 12:00:00 AM"],
        "TIME_OCC":[1600],
        "AREA_NAME":["Central"],
        "Crm_Cd":[442],
        "Crm_Cd_Desc":["SHOPLIFTING - PETTY THEFT ($950 & UNDER)"],
        "Status_Desc":["Invest Cont"],
        "Location_1":"(34.0416, -118.2550)"}]
  }}

Update Schema

Define Fields

Noticed that Location was indexed with type string since the value is in this format (latitude,longitude). For this field to be useful, the data has to be first cleansed and modified so each location value is latitude,longitude. Solr does support Spatial Search and have several field types for this purpose. LatLonType field type is the current default spatial field, where values are in the form latitude,longitude (without spaces).

Create a new collection called lapd_geo. Add a new field “Location_1” and set the type to “location” using the Schema API. For more information on field type properties, visit Apache Solr Reference Guide: Defining Fields

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field":{
     "name":"Location_1",
     "type":"location",
     "stored":true,
     "indexed":true,
     "multiValued":false }
}' http://hostname.abc.com:8983/solr/lapd_geo/schema

From Solr UI, look at the managed-schema file under lapd_geo’s configSet. There should now be a field “Location_1”:

  <field name="Location_1" type="location" multiValued="false" indexed="true" stored="true"/>

Modify the data so that all values under “Location 1” are in the form latitude,longitude.

Date Rptd,DR. NO,DATE OCC,TIME OCC,AREA,AREA NAME,RD,Crm Cd,Crm Cd Desc,Status,Status Desc,LOCATION,Cross Street,Location 1
2015-12-02T00:00:00Z,150126705,2015-12-02T00:00:00Z,150,1,Central,145,946,OTHER MISCELLANEOUS CRIME,IC,Invest Cont,  400 S  LOS ANGELES ST,,"34.0473,-118.2462"
2015-12-02T00:00:00Z,150126706,2015-12-02T00:00:00Z,220,1,Central,145,330,BURGLARY FROM VEHICLE,IC,Invest Cont,         LOS ANGELES, WINSTON,"34.0467,-118.2470"
2015-12-02T00:00:00Z,150126763,2015-12-02T00:00:00Z,1110,1,Central,162,442,SHOPLIFTING - PETTY THEFT ($950 & UNDER),IC,Invest Cont, 700 W 7TH  ST,,"34.0480,-118.2577"

Index the modified csv file.

# /usr/iop/current/solr-server/bin/post -c lapd_geo LAPD_Crime_and_Collision_Raw_Data_2015_geo.csv

Run the simple query for the new collection and look at the Location_1 field.

"Location_1":"34.0473,-118.2462"

Previously we were not able to query location since it was stored as type string. Run this following query to return the first 10 rows where the location is within 1km radius of coordinates: 34.0257,-118.2436. Looking at the output, “numfound” shows there is a total of 5722 documents that matches this query.

http://hostname1.abc.com:8983/solr/lapd_geo_shard1_replica1/select?q=*:*&fq={!geofilt%20sfield=Location_1}&pt=34.0473,-118.246&d=1&wt=json&indent=true

Solr is pretty good at dynamic field definitions if the data is in the correct format. The LAPD Crime and Collision data set does not adhere to Solr’s data format by default. For example, the location and date fields will need to be modified to Solr’s data format before we can take advantage of Solr’s indexing power. Modify the date fields to be in this format: YYYY-MM-DDTHH:MM:SSZ.

Date Rptd,DR. NO,DATE OCC,TIME OCC,AREA,AREA NAME,RD,Crm Cd,Crm Cd Desc,Status,Status Desc,LOCATION,Cross Street,Location 1
2015-12-02T00:00:00Z,150126705,2015-12-02T00:00:00Z,150,1,Central,145,946,OTHER MISCELLANEOUS CRIME,IC,Invest Cont,  400 S  LOS ANGELES  ST,,"34.0473,-118.2462"
2015-12-02T00:00:00Z,150126706,2015-12-02T00:00:00Z,220,1,Central,145,330,BURGLARY FROM VEHICLE,IC,Invest Cont,         LOS ANGELES, WINSTON,"34.0467,-118.2470"
2015-12-02T00:00:00Z,150126763,2015-12-02T00:00:00Z,1110,1,Central,162,442,SHOPLIFTING - PETTY THEFT ($950 & UNDER),IC,Invest Cont,  700 W 7TH ST,,"34.0480,-118.2577"

Using the data_driven_schema_configs configSet, create a new collection lapd_dynamic and index the data. Notice that Solr was able to recognize dates, but Location still got set to type string.

  <field name="AREA" type="tlongs"/>
  <field name="AREA_NAME" type="strings"/>
  <field name="Crm_Cd" type="tlongs"/>
  <field name="Crm_Cd_Desc" type="strings"/>
  <field name="Cross_Street" type="strings"/>
  <field name="DATE_OCC" type="tdates"/>
  <field name="DR._NO" type="tlongs"/>
  <field name="Date_Rptd" type="tdates"/>
  <field name="LOCATION" type="strings"/>
  <field name="Location_1" type="strings"/>
  <field name="RD" type="tlongs"/>
  <field name="Status" type="strings"/>
  <field name="Status_Desc" type="strings"/>
  <field name="TIME_OCC" type="tlongs"/>

Dynamic Fields

Instead of explicitly defining a field type, update the header column names with a suffix that Solr can recognize dynamically. The managed-schema file has a lot of dynamicFields already defined. Here are a few examples:

  <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
  <dynamicField name="*_dts" type="date" multiValued="true" indexed="true" stored="true"/>
  <dynamicField name="*_ds" type="doubles" indexed="true" stored="true"/>
  <dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
  <dynamicField name="*_ti" type="tint" indexed="true" stored="true"/>
  <dynamicField name="*_i" type="int" indexed="true" stored="true"/>
  <<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
  <dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/>
  <dynamicField name="*_tf" type="tfloat" indexed="true" stored="true"/>
  <dynamicField name="*_td" type="tdouble" indexed="true" stored="true"/>

Create a new collection lapd_suffix using the default data_driven_schema_configs configSet. Since a fieldType for location already exists, add a dynamic field rule for location.

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-dynamic-field":{
     "name":"*_latlon",
     "type":"location",
     "stored":true,
     "indexed":true,
     "multiValued":false }
}' http://hostname1.abc.com:8983/solr/lapd_suffix/schema

Verify by going to Solr UI and view the managed-schema file under lapd_suffix’s configSet.

  <dynamicField name="*_latlon" type="location" multiValued="false" indexed="true" stored="true"/>

Modify the column names in the csv file with the corresponding suffix for the type desired and index the modified data with the new header using the post tool.

Date_Rptd_dts,DR._NO_l,DATE_OCC_dt,TIME_OCC_i,AREA_i,AREA_NAME_strings,RD_i,Crm_Cd_i,Crm_Cd_Desc_ss,Status_ss,Status_Desc_ss,LOCATION_ss,Cross_Street_ss,Location_latlon

The managed-schema file did not add any new field names; instead, the field names are the same as what was defined in the header. For example, run the query above to return the first 10 rows where the location is within 1km radius of coordinates: 34.0257,-118.2436. Update the shard and field name first.

http://hostname1.abc.com:8983/solr/lapd_suffix_shard1_replica1/select?q=*:*&fq={!geofilt%20sfield=Location_latlon}&pt=34.0473,-118.246&d=1&wt=json&indent=true

{
  "responseHeader":{
    "status":0,
    "QTime":118,
    "params":{
      "q":"*:*",
      "pt":"34.0473,-118.246",
      "d":"1",
      "indent":"true",
      "fq":"{!geofilt sfield=Location_latlon}",
      "wt":"json"}},
  "response":{"numFound":5722,"start":0,"maxScore":1.0,"docs":[
      {
        "Date_Rptd_dts":["2015-12-02T00:00:00Z"],
        "DR._NO_l":150126705,
        "DATE_OCC_dt":"2015-12-02T00:00:00Z",
        "TIME_OCC_i":150,
        "AREA_i":1,
        "AREA_NAME_strings":["Central"],
        "RD_i":145,
        "Crm_Cd_i":946,
        "Crm_Cd_Desc_ss":["OTHER MISCELLANEOUS CRIME"],
        "Status_ss":["IC"],
        "Status_Desc_ss":["Invest Cont"],
        "LOCATION_ss":["  400 S  LOS ANGELES                  ST"],
        "Location_latlon":"34.0473,-118.2462",
        "id":"7348ee9a-afba-4a45-bc16-f5482627d0db",
        "_version_":1538040177258135552},
...

Which area has the highest crime rate? Solr says “77th Street”

http://hostname1.abc.com:8983/solr/lapd_suffix_shard1_replica1/select?q=*&fl=Location_latlon&wt=json&indent=true&facet=true&facet.field=AREA_NAME_strings

{
  "responseHeader":{
    "status":0,
    "QTime":242,
    "params":{
      "q":"*",
      "facet.field":"AREA_NAME_strings",
      "indent":"true",
      "fl":"Location_latlon",
      "wt":"json",
      "facet":"true"}},
  "response":{"numFound":228017,"start":0,"maxScore":1.0,"docs":[
      ...
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "AREA_NAME_strings":[
        "77th Street",15308,
        "Southwest",14733,
        "Pacific",12376,
        "N Hollywood",12229,
        "Southeast",11399,
        "Van Nuys",11220,
        "Mission",10906,
        "Northeast",10831,
        "Olympic",10790,
        "Central",10750,
        "West LA",10645,
        "Devonshire",10594,
        "Newton",10518,
        "Hollywood",10477,
        "Topanga",10154,
        "Harbor",9816,
        "Rampart",9735,
        "Wilshire",9632,
        "West Valley",9417,
        "Hollenbeck",8564,
        "Foothill",7923]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

Delete Index

Be cautions when deleting collections where multiple collections are linked to a configSet. Use the command below to delete the Solr collection and its corresponding configSet from Zookeeper:

SOLR_INCLUDE=/etc/solr/conf/solr.in.sh /usr/iop/current/solr-server/bin/solr delete -c lapd

The below rest API deletes the collection, but not its corresponding configSet:

curl 'http://hostname1.abc.com:8983/solr/admin/collections?action=DELETE&name=lapd'

Conclusion

This article shows some of the basic functionalities of Solr. There are many more features and complex queries that were not covered here. Solr Search offers query syntax and parsing, faceting, highlighting, spell-checks, result grouping/clustering, and etc. To learn more about Solr, visit the Apache Solr Reference Guide for 5.5.0.

Don’t forget, according to Solr, stay away from 77th street in LA!

Join The Discussion

Your email address will not be published. Required fields are marked *