Releases · UKPLab/sentence-transformers

04 Jan 14:04

nreimers

v0.4.1

de558ab

v0.4.1 - Faster Tokenization & Asymmetric Models

Refactored Tokenization

Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:

train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts
Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

Asymmetric Models
Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

word_embedding_model = models.Transformer(base_model, max_seq_length=250)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])

##Your input examples have to look like this:
inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)

##Encoding (Note: Mixed inputs are not allowed)
model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])

Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer.
More documentation on how to design asymmetric models will follow soon.

New Namespace & Models for Cross-Encoder
Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

Logging
Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

Unit tests
A lot more unit tests have been added, which test the different components of the framework.

Assets 2

22 Dec 13:42

nreimers

v0.4.0

28d6f90

v0.4.0 - Upgrade Transformers Version

Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
New application example for information retrieval and question answering retrieval. Together with respective pre-trained models

Assets 2

18 Nov 08:25

nreimers

v0.3.9

005fd08

v0.3.9 - Small updates

This release only include some smaller updates:

Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.

Assets 2

19 Oct 14:23

nreimers

v0.3.8

3d12b0c

v0.3.8 - CrossEncoder, Data Augmentation, new Models

Add support training and using CrossEncoder
Data Augmentation method AugSBERT added
New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

Smaller changes:

Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
SentenceTransformer.encode method detaches tensors from compute graph
SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty

Assets 2

29 Sep 20:17

nreimers

v0.3.7

a37ba6a

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

Minor changes:

Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
Added models.Normalize() to allow the normalization of embeddings to unit length

Assets 2

11 Sep 08:06

nreimers

v0.3.6

18c057c

v0.3.6 - Update transformers to v3.1.0

Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.

Assets 2

01 Sep 13:09

nreimers

v0.3.5

073dd37

v0.3.5 - Automatic Mixed Precision & Bugfixes

The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting model.fit(use_amp=True), AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory.
Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk
If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.
Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0
Several bugfixes: Downloading of files, mutli-GPU-encoding

Assets 2

24 Aug 16:24

nreimers

v0.3.4

e6759fa

v0.3.4 - Improved Documentation, Improved Tokenization Speed, Mutli-GPU encoding

The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set num_workers to a positive integer in your DataLoader, tokenization will happen in a background thread. This substantially increases the start-up time for training.
model.encode() uses also a PyTorch DataSet + DataLoader. If you set num_workers to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.
Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
Smaller bugfixes

Breaking changes:

Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator

Assets 2

06 Aug 08:16

nreimers

v0.3.3

f4377b2

v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements

New Functions

Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
Tokenization of datasets for training can now run in parallel (Linux Only)
New example for Quora Duplicate Questions Retrieval: See examples-folder
Many small improvements for training better models for Information Retrieval
Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
Added new Evaluators for ParaphraseMining and InformationRetrieval
evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

Breaking Changes

The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.

Assets 2

23 Jul 15:03

nreimers

v0.3.2

ec5d73b

v0.3.2 - Lazy tokenization for Parallel Sentence Training & Improved Semantic Search

This is a minor release. There should be no breaking changes.

ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Functions

Breaking Changes

Releases: UKPLab/sentence-transformers

v0.4.1 - Faster Tokenization & Asymmetric Models

v0.4.0 - Upgrade Transformers Version

v0.3.9 - Small updates

v0.3.8 - CrossEncoder, Data Augmentation, new Models

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

v0.3.6 - Update transformers to v3.1.0

v0.3.5 - Automatic Mixed Precision & Bugfixes

v0.3.4 - Improved Documentation, Improved Tokenization Speed, Mutli-GPU encoding

v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements

New Functions

Breaking Changes

v0.3.2 - Lazy tokenization for Parallel Sentence Training & Improved Semantic Search