Dwell in the Beginning: How Language Models Embed Documents for Retrieval

This repository contains code to reproduce the models and experiments associated with the paper Dwell in the Beginning: How Language Models Embed Documents for Retrieval

Installation

First, start by building the conda environment, downloading and processing the data.

./prepare_environment.sh env.yaml

./prepare_data.sh

Note: you may need to install torch versions that are supported by your machine. The provided environment defaults 2.0

Data formats

Exemplifying the MS-MARCO corpus in OpenMatch format.

Corpus: <doc_id>\t<title>\t<text> Example:

head -1 ./data/marco_documents_processed/corpus_firstp_2048.tsv
>>> "0       The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation.?        Science & Mathematics Physics The hot glowing surfaces of stars emit energy ...."

Queries: <query_id>\t<text>

head -1 ./data/marco_documents_processed/train.query.txt
>>> "1185869 )what was the immediate impact of the success of the manhattan project?"

Qrels: <query_id>\t<Q0>\t<doc_id>\t<relevancy>

head -1 ./data/marco_documents_processed/qrels.train.tsv
>>> "3       0       895028  1"

Negatives: <query_id>\t<neg_doc_id_1, ..., neg_doc_id_N>

head -1 ./data/marco_documents_processed/train.negatives.tsv
>>> "3       1896535,3054358,527877,2312708,1606470,1680606,2574509,2591560,712805,1678190"

Note: OpenMatch expects sequential integers on document ids. Hence the need for the processing step.

Model inference

./scripts/eval_dr.sh jmvcoelho/t5-base-marco-2048

This will create the results folder, with the trec-formatted run and trec-eval output for the model.

Model training

wandb login #if you haven't already

./scripts/train_dr.sh jmvcoelho/t5-base-marco-crop-pretrain-2048

This will:

pretokenize the data
do 1 episode on ance negatives, then 3 more episodes with negative refeshing
evaluate the final model
hard negatives also computed for final model in case you wish to pickup training

Training starts of crop pretrained model.

You may want to set $DATA_PATH on the script, as this writes to the current directory.

Experiments

The folder ./scripts/experiments contains the scripts used to run the positional biases experiments. (under refactorization)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dwell in the Beginning: How Language Models Embed Documents for Retrieval

Installation

Data formats

Model inference

Model training

Experiments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dwell in the Beginning: How Language Models Embed Documents for Retrieval

Installation

Data formats

Model inference

Model training

Experiments