Skip to content

Commit

Permalink
update readmes
Browse files Browse the repository at this point in the history
  • Loading branch information
nreimers committed Oct 19, 2020
1 parent 3824a19 commit 3d12b0c
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 5 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
*.pyc
*.gz
*.tsv
tmp_*.py
/examples/**/output/*
examples/datasets/*/
sentence_transformers.egg-info
dist/
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ If you use one of the multilingual models, feel free to cite our publication [Ma
If you use the code for [data augmentation](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/data_augmentation), feel free to cite our publication [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240):
```
@article{thakur-2020-AugSBERT,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna",
journal= "arXiv preprint arXiv:2010.08240",
month = "10",
Expand Down
26 changes: 26 additions & 0 deletions docs/pretrained-models/msmarco.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# MSMARCO Models
[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.

The training data constist of over 500k examples, while the complete corpus consist of over 8.8 Million passages.



## Version Histroy
As we work on the topic, we will publish updated (and improved) models.

### v1
Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128.

They can be used like this:
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilroberta-base-msmarco-v1')

query_embedding = model.encode('[QRY] ' + 'How big is London')
passage_embedding = model.encode('[DOC] ' + 'London has 9,787,426 inhabitants at the 2011 census')

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))
```

**Models**:
- **distilroberta-base-msmarco-v1** - Performance MSMARCO dev dataset (queries.dev.small.tsv) MRR@10: 23.28
6 changes: 4 additions & 2 deletions docs/pretrained_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Sadly there cannot exist a universal model that performs great on all possible t

## Paraphrase Identification

The following models were trained on Millions of paraphrase sentences. They create extremely good results for various similarity and retrieval tasks. They are currently under development, better versions and more details will be released in future.
The following models **are recommended for various applications**, as they were trained on Millions of paraphrase examples. They create extremely good results for various similarity and retrieval tasks. They are currently under development, better versions and more details will be released in future. But they many tasks they work better than the NLI / STSb models.

- **distilroberta-base-paraphrase-v1** - Trained on large scale paraphrase data.
- **xlm-r-distilroberta-base-paraphrase-v1** - Multilingual version of distilroberta-base-paraphrase-v1, trained on parallel data for 50+ languages.
Expand All @@ -31,7 +31,7 @@ The following models were optimized for [Semantic Textual Similarity](usage/sema

[» Full List of STS Models](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0)

I can recommend the **distilbert-base-nli-stsb-mean-tokens** model, which gives a nice balance between speed and performance.


## Duplicate Questions Detection

Expand Down Expand Up @@ -60,6 +60,8 @@ print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

You can index the passages as shown [here](https://www.sbert.net/docs/usage/semantic_search.html).

[More details](pretrained-models/msmarco.md)


## Multi-Lingual Models
The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language. Details are in our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813):
Expand Down
2 changes: 1 addition & 1 deletion docs/publications.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ If you use one of the multilingual models, feel free to cite our publication [Ma
If you use the code for [data augmentation](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/data_augmentation), feel free to cite our publication [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240):
```
@article{thakur-2020-AugSBERT,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna",
journal= "arXiv preprint arXiv:2010.08240",
month = "10",
Expand Down
10 changes: 9 additions & 1 deletion examples/training/data_augmentation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,5 +86,13 @@ The [examples/training/data_augmentation](https://github.com/UKPLab/sentence-tra

## Citation
If you use the code for augmented sbert, feel free to cite our publication [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240):
```
```
@article{thakur-2020-AugSBERT,
title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna",
journal= "arXiv preprint arXiv:2010.08240",
month = "10",
year = "2020",
url = "https://arxiv.org/abs/2010.08240",
}
```

0 comments on commit 3d12b0c

Please sign in to comment.