Skip to content

Latest commit

 

History

History
192 lines (132 loc) · 10.9 KB

solrini.md

File metadata and controls

192 lines (132 loc) · 10.9 KB

Solrini: Anserini Integration with Solr

This page documents code for replicating results from the following paper:

We provide instructions for setting up a single-node SolrCloud instance running locally and indexing into it from Anserini. Instructions for setting up SolrCloud clusters can be found by searching the web.

Setting up a Single-Node SolrCloud Instance

From the Solr archives, download the Solr (non -src) version that matches Anserini's Lucene version to the anserini/ directory.

Extract the archive:

mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1

Start Solr:

solrini/bin/solr start -c -m 8G

Adjust memory usage (i.e., -m 8G as appropriate).

Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:

pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd

Solr should now be available at http://localhost:8983/ for browsing.

The Solr index schema can also be modified using the Schema API. This is useful for specifying field types and other properties including multiValued fields.

Schemas for setting up specific Solr index schemas can be found in the src/main/resources/solr/schemas/ folder.

To set the schema, we can make a request to the Schema API:

curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json http://localhost:8983/solr/COLLECTION_NAME/schema

For Robust04 example below, this isn't necessary.

Indexing into SolrCloud from Anserini

We can use Anserini as a common "front-end" for indexing into SolrCloud, thus supporting the same range of test collections that's already included in Anserini (when directly building local Lucene indexes). Indexing into Solr is similar indexing to disk with Lucene, with a few added parameters. Most notably, we replace the -index parameter (which specifies the Lucene index path on disk) with Solr parameters. Alternatively, Solr can also be configured to read prebuilt Lucene index, since Solr uses Lucene indexes under the hood.

We'll index Robust04 as an example. First, create the robust04 collection in Solr:

solrini/bin/solr create -n anserini -c robust04

Run the Solr indexing command for robust04:

sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator DefaultLuceneDocumentGenerator \
  -threads 8 -input /path/to/disk45 \
  -solr -solr.index robust04 -solr.zkUrl localhost:9983 \
  -storePositions -storeDocvectors -storeRaw

Make sure /path/to/disk45 is updated with the appropriate path for the Robust04 collection.

Once indexing has completed, you should be able to query robust04 from the Solr query interface.

You can also run the following command to replicate Anserini BM25 retrieval:

sh target/appassembler/bin/SearchSolr -topicreader Trec \
  -solr.index robust04 -solr.zkUrl localhost:9983 \
  -topics src/main/resources/topics-and-qrels/topics.robust04.txt \
  -output runs/run.solr.robust04.bm25.topics.robust04.txt

Evaluation can be performed using trec_eval:

$ tools/eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust04.txt runs/run.solr.robust04.bm25.topics.robust04.txt
map                   	all	0.2531
P_30                  	all	0.3102

Solrini has also been verified to work with following collections as well:

See run_solr_regression.py regression script for more details.

Solr with a Pre-built Lucene Index

It is possible for Solr to read pre-built Lucene indexes. To achieve this, some housekeeping is required to "install" the pre-built indexes. The following uses Robust04 as an example. Let's assume the pre-built index is stored at indexes/lucene-index.robust04.pos+docvectors+raw/.

First, a Solr collection must be created to house the index. Here, we create a collection robust04 with configset anserini.

solrini/bin/solr create -n anserini -c robust04

Along with the collection, Solr will create a core instance, whose name can be found in the Solr UI under collection overview. It'll look something like <collection_name>_shard<id>_replica_<id> (e.g., robust04_shard1_replica_n1). Solr stores configurations and data for the core instances under Solr home, which for us is solrini/server/solr/ by default.

Second, make proper Solr schema adjustments if necessary. Here, robust04 is a TREC collection whose schema is already handled by managed-schema in the Solr configset. However, for a collection such as cord19, remember to make proper adjustments to the Solr schema (also see above):

curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json http://localhost:8983/solr/COLLECTION_NAME/schema

Finally, we can copy the pre-built index to the local where Solr expects them. Start by removing data that's there:

rm solrini/server/solr/robust04_shard1_replica_n1/data/index/*

Then, simply copy the pre-built Lucene indexes into that location:

cp indexes/lucene-index.robust04.pos+docvectors+raw/* solrini/server/solr/robust04_shard1_replica_n1/data/index

Restart Solr to make sure changes take effect:

solrini/bin/solr stop
solrini/bin/solr start -c -m 8G

You can confirm that everything works by performing a retrieval run and checking the results (see above).

Solr integration test

We have an end-to-end integration testing script run_solr_regression.py. See example usage for Robust04 below:

# Check if Solr server is on
python src/main/python/run_solr_regression.py --ping

# Check if robust04 exists
python src/main/python/run_solr_regression.py --check-index-exists robust04

# Create robust04 if it does not exist
python src/main/python/run_solr_regression.py --create-index robust04

# Delete robust04 if it exists
python src/main/python/run_solr_regression.py --delete-index robust04

# Insert documents from /path/to/disk45 into robust04
python src/main/python/run_solr_regression.py --insert-docs core18 --input /path/to/disk45

# Search and evaluate on robust04
python src/main/python/run_solr_regression.py --evaluate robust04

To run end-to-end, issue the following command:

python src/main/python/run_solr_regression.py --regression robust04 --input /path/to/disk45

The regression script has been verified to work for robust04, core18, msmarco-passage, msmarco-doc.

Replication Log