Merge branch 'dev-0.4.1'

UKPLab · Jan 4, 2021 · de558ab · de558ab
2 parents 10ed2a8 + 195784b
commit de558ab
Show file tree

Hide file tree

Showing 81 changed files with 1,031 additions and 766 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 .idea
+.vscode
 *.pyc
 *.gz
 *.tsv

diff --git a/docs/package_reference/cross_encoder.md b/docs/package_reference/cross_encoder.md
@@ -10,6 +10,7 @@ For an introduction to Cross-Encoders, see [Cross-Encoders](../usage/cross-encod
 CrossEncoder have their own evaluation classes, that are in `sentence_transformers.cross_encoder.evaluation`.
 
 ```eval_rst
+.. autoclass:: sentence_transformers.cross_encoder.evaluation.CEBinaryAccuracyEvaluator
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CEBinaryClassificationEvaluator
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CECorrelationEvaluator
 .. autoclass:: sentence_transformers.cross_encoder.evaluation.CESoftmaxAccuracyEvaluator

diff --git a/docs/package_reference/datasets.md b/docs/package_reference/datasets.md
@@ -2,11 +2,6 @@
 `sentence_transformers.datasets` contains classes to organize your training input examples.
 
 
-## SentencesDataset
-`SentencesDataset` is the main class to store training classes for training. For details, see [training overview](../training/overview.md). 
-```eval_rst
-.. autoclass:: sentence_transformers.datasets.SentencesDataset
-```
 
 ## ParallelSentencesDataset
 `ParallelSentencesDataset` is used for multilingual training. For details, see [multilingual training](../../examples/training/multilingual/README.md).

diff --git a/docs/package_reference/models.md b/docs/package_reference/models.md
@@ -10,6 +10,7 @@
 
 ## Further Classes
 ```eval_rst
+.. autoclass:: sentence_transformers.models.Asym
 .. autoclass:: sentence_transformers.models.BoW
 .. autoclass:: sentence_transformers.models.CNN
 .. autoclass:: sentence_transformers.models.LSTM

diff --git a/docs/pretrained_cross-encoders.md b/docs/pretrained_cross-encoders.md
@@ -7,41 +7,77 @@ This page lists available **pretrained Cross-Encoders**. Cross-Encoders require
 
 ## STSbenchmark
 The following models can be used like this:
-```
+```python
 from sentence_transformers import CrossEncoder
 model = CrossEncoder('model_name')
 scores = model.predict([('Sent A1', 'Sent B1'), ('Sent A2', 'Sent B2')])
 ```
 
 They return a score  0...1 indicating the semantic similarity of the given sentence pair.
-- **sentence-transformers/ce-distilroberta-base-stsb** - STSbenchmark test performance: 87.92
-- **sentence-transformers/ce-roberta-base-stsb** - STSbenchmark test performance: 90.17
-- **sentence-transformers/ce-roberta-large-stsb** - STSbenchmark test performance: 91.47 
+- **cross-encoder/stsb-TinyBERT-L-4** - STSbenchmark test performance: 85.50
+- **cross-encoder/stsb-distilroberta-base** - STSbenchmark test performance: 87.92
+- **cross-encoder/stsb-roberta-base** - STSbenchmark test performance: 90.17
+- **cross-encoder/stsb-roberta-large** - STSbenchmark test performance: 91.47 
 
 ## Quora Duplicate Questions
 These models have been trained on the [Quora duplicate questions dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs). They can used like the STSb models and give a score 0...1 indicating the probability that two questions are duplicate questions.
 
-- **sentence-transformers/ce-distilroberta-base-quora** - Average Precision dev set: 87.48
-- **sentence-transformers/ce-roberta-base-quora** - Average Precision dev set: 87.80
-- **sentence-transformers/ce-roberta-large-quora** - Average Precision dev set: 87.91
+- **cross-encoder/quora-distilroberta-base** - Average Precision dev set: 87.48
+- **cross-encoder/quora-roberta-base** - Average Precision dev set: 87.80
+- **cross-encoder/quora-roberta-large** - Average Precision dev set: 87.91
 
+Note: The model don't work for question similarity. The question *How to learn Java* and *How to learn Python* will get a low score, as these questions are not duplicates. For question similarity, the respective bi-encoder trained on the Quora dataset yields much more meaningful results.
 
 ## Information Retrieval
 
 The following models are trained for Information Retrieval: Given a query (like key-words or a question), and a paragraph, can the query be answered by the paragraph? The models have beend trained on MS Marco, a large dataset with real-user queries from Bing search engine.
 
 The models can be used like this:
-```
+```python
 from sentence_transformers import CrossEncoder
 model = CrossEncoder('model_name', max_length=512)
-scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2')])
+scores = model.predict([('Query1', 'Paragraph1'), ('Query2', 'Paragraph2')])
+
+#For Example
+scores = model.predict([('How many people live in Berlin?', 'Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'), 
+                        ('What is the size of New York?', 'New York City is famous for the Metropolitan Museum of Art.')])
 ```
 
 This returns a score 0...1 indicating if the paragraph is relevant for a given query.
 
-- **sentence-transformers/ce-ms-marco-TinyBERT-L-2** - MRR@10 on MS Marco Dev Set: 30.15
-- **sentence-transformers/ce-ms-marco-TinyBERT-L-4** -  MRR@10 on MS Marco Dev Set: 34.50
-- **sentence-transformers/ce-ms-marco-TinyBERT-L-6** - MRR@10 on MS Marco Dev Set: 36.13
-- **sentence-transformers/ce-ms-marco-electra-base** - MRR@10 on MS Marco Dev Set: 36.41
 
-For details on the usage, see [Applications - Information Retrieval](../examples/applications/information-retrieval/README.md)
+For details on the usage, see [Applications - Information Retrieval](../examples/applications/information-retrieval/README.md)
+
+
+### MS MARCO
+[MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset with real user queries from Bing search engine with annotated relevant text passages.
+- **cross-encoder/ms-marco-TinyBERT-L-2** - MRR@10 on MS Marco Dev Set: 30.15
+- **cross-encoder/ms-marco-TinyBERT-L-4** - MRR@10 on MS Marco Dev Set: 34.50
+- **cross-encoder/ms-marco-TinyBERT-L-6** - MRR@10 on MS Marco Dev Set: 36.13
+- **cross-encoder/ms-marco-electra-base** - MRR@10 on MS Marco Dev Set: 36.41
+
+### SQuAD (QNLI)
+
+QNLI is based on the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) and was introduced by the [GLUE Benchmar](https://arxiv.org/abs/1804.07461). Given a passage from Wikipedia, annotators created questions that are answerable by that passage.
+
+- **cross-encoder/qnli-distilroberta-base** - Accuracy on QNLI dev set: 90.96
+- **cross-encoder/qnli-electra-base** - Accuracy on QNLI dev set: 93.21
+
+
+
+## NLI
+Given two sentences, are these contradicting each other, entailing one the other or are these netural? The following models were trained on the [SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) datasets.
+- **cross-encoder/nli-distilroberta-base** - Accuracy on MNLI mismatched set: 83.98
+- **cross-encoder/nli-roberta-base** - Accuracy on MNLI mismatched set: 87.47
+- **cross-encoder/nli-deberta-base** - Accuracy on MNLI mismatched set: 88.08
+
+```python
+from sentence_transformers import CrossEncoder
+model = CrossEncoder('model_name')
+scores = model.predict([('A man is eating pizza', 'A man eats something'), ('A black race car starts up in front of a crowd of people.', 'A man is driving down a lonely road.')])
+
+#Convert scores to labels
+label_mapping = ['contradiction', 'entailment', 'neutral']
+labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)]
+```
+
diff --git a/docs/training/overview.md b/docs/training/overview.md
@@ -57,19 +57,16 @@ For all available building blocks see [» Models Package Reference](../package_r
  To represent our training data, we use the `InputExample` class to store training examples. As parameters, it accepts texts, which is a list of strings representing our pairs (or triplets). Further, we can also pass a label (either float or int). The following shows a simple example, where we pass text pairs to `InputExample` together with a label indicating the semantic similarity.
 
  ```python
-from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample
+from sentence_transformers import SentenceTransformer, InputExample
 from torch.utils.data import DataLoader
 
 model = SentenceTransformer('distilbert-base-nli-mean-tokens')
 train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
     InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
-train_dataset = SentencesDataset(train_examples, model)
-train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16)
+train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
  ```
 
-To prepare the examples for training, we provide a custom `SentencesDataset`, which is a [custom PyTorch dataset](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html). It accepts as parameters the list with `InputExamples` and the `SentenceTransformer` model.
-
-We can wrap `SentencesDataset` with the standard PyTorch `DataLoader`, which produces for example batches and allows us to shuffle the data for training.
+We wrap our `train_examples` with the standard PyTorch `DataLoader`, which shuffles our data and produces batches of certain sizes.
 
 
 
@@ -92,7 +89,7 @@ For each sentence pair, we pass sentence A and sentence B through our network wh
 
 A minimal example with `CosineSimilarityLoss` is the following:
 ```python
-from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses
+from sentence_transformers import SentenceTransformer, InputExample, losses
 from torch.utils.data import DataLoader
 
 #Define the model. Either from scratch of by loading a pre-trained model
@@ -103,8 +100,7 @@ train_examples = [InputExample(texts=['My first sentence', 'My second sentence']
     InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
 
 #Define your train dataset, the dataloader and the train loss
-train_dataset = SentencesDataset(train_examples, model)
-train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16)
+train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
 train_loss = losses.CosineSimilarityLoss(model)
 
 #Tune the model
@@ -142,7 +138,7 @@ model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_st
 
 
 ### Continue Training on Other Data
-[training_stsbenchmark_continue_training.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_transformers/training_stsbenchmark_continue_training.py) shows an example where training on a fine-tuned model is continued. In that example, we use a sentence transformer model that was first fine-tuned on the NLI dataset and then continue training on the training data from the STS benchmark.
+[training_stsbenchmark_continue_training.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py) shows an example where training on a fine-tuned model is continued. In that example, we use a sentence transformer model that was first fine-tuned on the NLI dataset and then continue training on the training data from the STS benchmark.
 
 First, we load a pre-trained model from the server:
 ```python
@@ -152,9 +148,7 @@ model = SentenceTransformer('bert-base-nli-mean-tokens')
 
 The next steps are as before. We specify training and dev data:
 ```python
-sts_reader = STSBenchmarkDataReader('datasets/stsbenchmark', normalize_scores=True)
-train_data = SentencesDataset(sts_reader.get_examples('sts-train.csv'), model)
-train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
+train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
 train_loss = losses.CosineSimilarityLoss(model=model)
 
 evaluator = EmbeddingSimilarityEvaluator.from_input_examples(sts_reader.get_examples('sts-dev.csv'))

diff --git a/examples/applications/computing-embeddings/computing_embeddings.py b/examples/applications/computing-embeddings/computing_embeddings.py
@@ -19,7 +19,7 @@
 
 
 # Load pre-trained Sentence Transformer Model (based on DistilBERT). It will be downloaded automatically
-model = SentenceTransformer('paraphrase-distilroberta-base-v1')
+model = SentenceTransformer('average_word_embeddings_glove.6B.300d')
 
 # Embed a list of sentences
 sentences = ['This framework generates embeddings for each input sentence',

diff --git a/examples/applications/cross-encoder/cross-encoder_reranking.py b/examples/applications/cross-encoder/cross-encoder_reranking.py
@@ -22,7 +22,7 @@
 
 # To refine the results, we use a CrossEncoder. A CrossEncoder gets both inputs (input_question, retrieved_question)
 # and outputs a score 0...1 indicating the similarity.
-cross_encoder_model = CrossEncoder('sentence-transformers/ce-roberta-base-stsb')
+cross_encoder_model = CrossEncoder('cross-encoder/roberta-base-stsb')
 
 # Dataset we want to use
 url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"

diff --git a/examples/applications/cross-encoder/cross-encoder_usage.py b/examples/applications/cross-encoder/cross-encoder_usage.py
@@ -7,7 +7,7 @@
 import numpy as np
 
 # Pre-trained cross encoder
-model = CrossEncoder('sentence-transformers/ce-distilroberta-base-stsb')
+model = CrossEncoder('cross-encoder/distilroberta-base-stsb')
 
 # We want to compute the similarity between the query sentence
 query = 'A man is eating pasta.'

diff --git a/examples/applications/information-retrieval/README.md b/examples/applications/information-retrieval/README.md
@@ -78,10 +78,10 @@ In the following table, we provide various pre-trained Cross-Encoders together w
 
 | Model-Name        | NDCG@10 (TREC DL 19) | MRR@10 (MS Marco Dev)  | Docs / Sec (BertTokenizerFast) | Docs / Sec |
 | ------------- |:-------------| -----| --- | --- |
-| sentence-transformers/ce-ms-marco-TinyBERT-L-2  | 67.43 | 30.15  | 9000 | 780
-| sentence-transformers/ce-ms-marco-TinyBERT-L-4  | 68.09 | 34.50  | 2900 | 760
-| sentence-transformers/ce-ms-marco-TinyBERT-L-6 |  69.57 | 36.13  | 680 | 660
-| sentence-transformers/ce-ms-marco-electra-base | 71.99 | 36.41 | 340 | 340
+| cross-encoder/ms-marco-TinyBERT-L-2  | 67.43 | 30.15  | 9000 | 780
+| cross-encoder/ms-marco-TinyBERT-L-4  | 68.09 | 34.50  | 2900 | 760
+| cross-encoder/ms-marco-TinyBERT-L-6 |  69.57 | 36.13  | 680 | 660
+| cross-encoder/ms-marco-electra-base | 71.99 | 36.41 | 340 | 340
 | *Other models* | | | |
 | nboost/pt-tinybert-msmarco | 63.63 | 28.80 | 2900 | 760
 | nboost/pt-bert-base-uncased-msmarco | 70.94 | 34.75 | 340 | 340|

diff --git a/examples/applications/information-retrieval/in_document_search_crossencoder.py b/examples/applications/information-retrieval/in_document_search_crossencoder.py
@@ -7,7 +7,7 @@
 
 The CrossEncoder takes the search query and scores every passage how relevant the passage is for the given score. The five passages with the highest score are  then returned.
 
-As CrossEncoder, we use sentence-transformers/ce-ms-marco-TinyBERT-L-2, a BERT model with only 2 layers trained on the MS MARCO dataset. This is an extremely quick model able to score up to 9000 passages per second (on a V100 GPU). You can also use a larger model, which gives better results but is also slower.
+As CrossEncoder, we use cross-encoder/ms-marco-TinyBERT-L-2, a BERT model with only 2 layers trained on the MS MARCO dataset. This is an extremely quick model able to score up to 9000 passages per second (on a V100 GPU). You can also use a larger model, which gives better results but is also slower.
 
 Note: As we score the [query, passage]-pair for every new query, this search method
 becomes at some point in-efficient if the document gets too large.
@@ -61,7 +61,7 @@
 
 
 ## Load our cross-encoder. Use fast tokenizer to speed up the tokenization
-model = CrossEncoder('sentence-transformers/ce-ms-marco-TinyBERT-L-2')
+model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2')
 
 ## Some queries we want to search for in the document
 queries = ["How large is Europe?",

diff --git a/examples/applications/information-retrieval/qa_retrieval_simple_wikipedia.py b/examples/applications/information-retrieval/qa_retrieval_simple_wikipedia.py
@@ -7,7 +7,7 @@
 For semantic search, we use SentenceTransformer('msmarco-distilroberta-base-v2') and retrieve
 100 potentially passages that answer the input query.
 
-Next, we use a more powerful CrossEncoder (cross_encoder = CrossEncoder('sentence-transformers/ce-ms-marco-TinyBERT-L-6')) that
+Next, we use a more powerful CrossEncoder (cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')) that
 scores the query and all retrieved passages for their relevancy. The cross-encoder is neccessary to filter out certain noise
 that might be retrieved from the semantic search step.
 """
@@ -22,7 +22,7 @@
 top_k = 100     #Number of passages we want to retrieve with the bi-encoder
 
 #The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
-cross_encoder = CrossEncoder('sentence-transformers/ce-ms-marco-TinyBERT-L-6')
+cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')
 
 # As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
 # about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

diff --git a/examples/evaluation/evaluation_stsbenchmark.py b/examples/evaluation/evaluation_stsbenchmark.py
@@ -7,7 +7,7 @@
 python evaluation_stsbenchmark.py model_name
 """
 from torch.utils.data import DataLoader
-from sentence_transformers import SentenceTransformer,  SentencesDataset, LoggingHandler
+from sentence_transformers import SentenceTransformer,  LoggingHandler
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.readers import STSBenchmarkDataReader
 import logging

diff --git a/examples/evaluation/evaluation_stsbenchmark_sbert-wk.py b/examples/evaluation/evaluation_stsbenchmark_sbert-wk.py
@@ -7,7 +7,7 @@
 Hence, WKPooling runs on the GPU, which makes it rather in-efficient.
 """
 from torch.utils.data import DataLoader
-from sentence_transformers import SentenceTransformer,  SentencesDataset, LoggingHandler, models
+from sentence_transformers import SentenceTransformer, LoggingHandler, models
 from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
 from sentence_transformers.readers import STSBenchmarkDataReader
 import logging

diff --git a/examples/evaluation/evaluation_translation_matching.py b/examples/evaluation/evaluation_translation_matching.py
@@ -34,6 +34,8 @@
                     level=logging.INFO,
                     handlers=[LoggingHandler()])
 
+logger = logging.getLogger(__name__)
+
 model_name = sys.argv[1]
 filepaths = sys.argv[2:]
 inference_batch_size = 32
@@ -51,7 +53,7 @@
                 src_sentences.append(splits[0])
                 trg_sentences.append(splits[1])
 
-    logging.info(os.path.basename(filepath)+": "+str(len(src_sentences))+" sentence pairs")
+    logger.info(os.path.basename(filepath)+": "+str(len(src_sentences))+" sentence pairs")
     dev_trans_acc = evaluation.TranslationEvaluator(src_sentences, trg_sentences, name=os.path.basename(filepath), batch_size=inference_batch_size)
     dev_trans_acc(model)
-Original file line number
+Diff line change
@@ -1,4 +1,5 @@
     .idea
+    .vscode
     *.pyc
     *.gz
     *.tsv
@@ Expand Down @@