update readmes

UKPLab · Oct 19, 2020 · 3d12b0c · 3d12b0c
1 parent 3824a19
commit 3d12b0c
Show file tree

Hide file tree

Showing 6 changed files with 43 additions and 5 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,8 @@
 *.pyc
 *.gz
 *.tsv
+tmp_*.py
+/examples/**/output/*
 examples/datasets/*/
 sentence_transformers.egg-info
 dist/

diff --git a/README.md b/README.md
@@ -285,7 +285,7 @@ If you use one of the multilingual models, feel free to cite our publication [Ma
 If you use the code for [data augmentation](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/data_augmentation), feel free to cite our publication [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240):
 ``` 
 @article{thakur-2020-AugSBERT,
-    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
+    title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
     author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and  Gurevych, Iryna", 
     journal= "arXiv preprint arXiv:2010.08240",
     month = "10",

diff --git a/docs/pretrained-models/msmarco.md b/docs/pretrained-models/msmarco.md
@@ -0,0 +1,26 @@
+# MSMARCO Models
+[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.
+
+The training data constist of over 500k examples, while the complete  corpus consist of over 8.8 Million passages.
+
+
+
+## Version Histroy 
+As we work on the topic, we will publish updated (and improved) models.
+
+### v1
+Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128.
+
+They can be used like this:
+```python
+from sentence_transformers import SentenceTransformer, util
+model = SentenceTransformer('distilroberta-base-msmarco-v1')
+
+query_embedding = model.encode('[QRY] ' + 'How big is London')
+passage_embedding = model.encode('[DOC] ' + 'London has 9,787,426 inhabitants at the 2011 census')
+
+print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))
+```
+
+**Models**:
+- **distilroberta-base-msmarco-v1** - Performance MSMARCO dev dataset (queries.dev.small.tsv) MRR@10: 23.28
diff --git a/docs/pretrained_models.md b/docs/pretrained_models.md
@@ -15,7 +15,7 @@ Sadly there cannot exist a universal model that performs great on all possible t
 
 ## Paraphrase Identification
 
-The following models were trained on Millions of paraphrase sentences. They create extremely good results for various similarity and retrieval tasks. They are currently under development, better versions and more details will be released in future.
+The following models **are recommended for various applications**, as they were trained on Millions of paraphrase examples. They create extremely good results for various similarity and retrieval tasks. They are currently under development, better versions and more details will be released in future. But they many tasks they work better than the NLI / STSb models.
 
 - **distilroberta-base-paraphrase-v1** - Trained on large scale paraphrase data.
 - **xlm-r-distilroberta-base-paraphrase-v1** - Multilingual version of distilroberta-base-paraphrase-v1, trained on parallel data for 50+ languages. 
@@ -31,7 +31,7 @@ The following models were optimized for [Semantic Textual Similarity](usage/sema
 
 [» Full List of STS Models](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0)
 
-I can recommend the **distilbert-base-nli-stsb-mean-tokens** model, which gives a nice balance between speed and performance.
+
 
 ## Duplicate Questions Detection
 
@@ -60,6 +60,8 @@ print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))
 
 You can index the passages as shown [here](https://www.sbert.net/docs/usage/semantic_search.html).
 
+[More details](pretrained-models/msmarco.md)
+
 
 ## Multi-Lingual Models
 The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language.  Details are in our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813):

diff --git a/docs/publications.md b/docs/publications.md
@@ -31,7 +31,7 @@ If you use one of the multilingual models, feel free to cite our publication [Ma
 If you use the code for [data augmentation](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/data_augmentation), feel free to cite our publication [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240):
 ``` 
 @article{thakur-2020-AugSBERT,
-    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
+    title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
     author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and  Gurevych, Iryna", 
     journal= "arXiv preprint arXiv:2010.08240",
     month = "10",

diff --git a/examples/training/data_augmentation/README.md b/examples/training/data_augmentation/README.md
@@ -86,5 +86,13 @@ The [examples/training/data_augmentation](https://github.com/UKPLab/sentence-tra
 
 ## Citation
 If you use the code for augmented sbert, feel free to cite our publication [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240):
-```
+``` 
+@article{thakur-2020-AugSBERT,
+    title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
+    author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and  Gurevych, Iryna", 
+    journal= "arXiv preprint arXiv:2010.08240",
+    month = "10",
+    year = "2020",
+    url = "https://arxiv.org/abs/2010.08240",
+}
 ```