Skip to content

Latest commit

 

History

History
147 lines (113 loc) · 5.36 KB

experiments-kilt.md

File metadata and controls

147 lines (113 loc) · 5.36 KB

Pyserini: Reproducing BM25 Baselines for KILT

Note: Currently, this guide requires at least ~100GB of disk space available, since we will be working with snapshots of Wikipedia.

Setup env

# Create a virtual env
conda create -n kilt37 -y python=3.7 && conda activate kilt37

# Get the development installation of pyserini
git clone https://github.com/castorini/pyserini.git
pip install pyserini

# Get KILT scripts, input and gold data, and install the package
git clone https://github.com/facebookresearch/KILT.git
cd KILT
pip install -r requirements.txt
pip install .
mkdir data
python scripts/donwload_all_kilt_data.py
python scripts/get_triviaqa_input.py
cd ..

# Get NLTK dependencies
python -m nltk.downloader punkt
python -m nltk.downloader stopwords

# Get the KILT knowledge source / wikipedia dump (34.76GiB)
cd pyserini/collections/
wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json

# We'll split it in multiple files to make processing faster
mkdir kilt_knowledge_split
cd kilt_knowledge_split
split -l500000 ../kilt_knowledgesource.json kilt_ks.
cd ../../..

# Feel free to delete the kilt_knowledgesource.json file now if you need more disk space.

The rest of the instructions assume you are working at the following directory:

<dir>/ (*) <- here
    KILT/
    pyserini/

Index the knowledge source

Convert to passage or document level JSONL format indexable by Pyserini. You can inspect the individual nohup output files using tail -f <file>:

Document-level

mkdir pyserini/collections/kilt_document
for filename in pyserini/collections/kilt_knowledge_split/kilt_ks.??; do
    [ -e "$filename" ] || continue
    nohup python pyserini/scripts/kilt/convert_kilt_to_document_jsonl.py \
        --input "$filename" \
        --output pyserini/collections/kilt_document/$(basename "$filename") \
        --flen 500000 \
        > nohup_$(basename "$filename").out &
done

# Once it's done, convert back into 1 file:
cat pyserini/collections/kilt_document/kilt_ks.?? > pyserini/collections/kilt_document/dump.jsonl
rm pyserini/collections/kilt_document/kilt_ks.??
# Sanity check (should give the same # of lines):
wc -l pyserini/collections/kilt_knowledgesource.json
wc -l pyserini/collections/kilt_document/dump.jsonl

# Finally, index into Anserini (about 1hr):
nohup python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
 -threads 40 -input pyserini/collections/kilt_document/ \
 -index pyserini/indexes/kilt_document -storePositions -storeDocvectors -storeContents &

Passage-level

mkdir pyserini/collections/kilt_passage
for filename in pyserini/collections/kilt_knowledge_split/kilt_ks.??; do
    [ -e "$filename" ] || continue
    nohup python pyserini/scripts/kilt/convert_kilt_to_passage_jsonl.py \
        --input "$filename" \
        --output pyserini/collections/kilt_passage/$(basename "$filename") \
        --sections --bigrams --stem \
        --flen 500000 \
        > nohup_$(basename "$filename").out &
done

# Once it's done, convert back into 1 file:
cat pyserini/collections/kilt_passage/kilt_ks.?? > pyserini/collections/kilt_passage/dump.jsonl
rm pyserini/collections/kilt_passage/kilt_ks.??

# Finally, index into Anserini (about 1hr):
nohup python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
 -threads 40 -input pyserini/collections/kilt_passage/ \
 -index pyserini/indexes/kilt_passage -storePositions -storeDocvectors -storeContents &

Create runs

Compute a run for a given index. Tasks can be configured using --config. You can increase the number of threads, but you may encounter OOM issues. I find that 8-20 is usually a good amount. This will take a 1-2 hours.

nohup python pyserini/scripts/kilt/run_retrieval.py \
 --config pyserini/scripts/kilt/dev_data.json \
 --index_dir pyserini/indexes/kilt_document \
 --output_dir pyserini/runs \
 --threads 8 \
 --topk 1000 \
 --name kilt_document &

You can use kilt_passage instead to run retrieval using the passage-level index.

Evaluate

# Evaluate runs (takes a few minutes if you used topk 1000)
nohup ./pyserini/scripts/kilt/eval_runs.sh pyserini/runs/kilt_document 1,100,1000 > results.out &

You can use kilt_passage instead to run evaluation using the passage-level run.

Results

Your results should look like this:

For R-Precision:

model FEV AY2 WnWi WnCw T-REx zsRE NQ HoPo TQA ELI5 WoW
baseline drqa (tfidf + bigram hashing) 50.75 2.44 0.15 1.27 43.43 60.63 28.59 34.63 45.70 11.02 41.82
anserini (document) 38.21 3.43 0.09 2.71 44.64 50.08 29.93 38.37 36.76 7.17 22.27
anserini (passage) 43.04 3.18 0.15 2.75 55.06 67.50 24.64 41.43 24.95 5.84 24.85

For Recall@100/1000:

model FEV AY2 WnWi WnCw T-REx zsRE NQ HoPo TQA ELI5 WoW
baseline drqa (tfidf + bigram hashing) 91.87/96.54 - - - 84.82/94.16 94.12/97.29 70.98/84.99 62.32/80.57 87.04/94.95 39.98/56.77 91.53/96.47
anserini (document) 88.41/95.65 - - - 83.24/92.36 91.83/97.82 75.12/87.59 59.66/78.59 81.21/92.36 34.00/53.12 69.95/83.58
anserini (passage) 91.99/95.79 - - - 88.03/94.25 98.01/99.25 75.55/87.08 61.52/77.80 80.18/91.32 32.50/47.85 65.96/78.45