Skip to content

Releases: UKPLab/sentence-transformers

v0.4.1 - Faster Tokenization & Asymmetric Models

04 Jan 14:04
Compare
Choose a tag to compare

Refactored Tokenization

  • Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
  • Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
  • If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts
  • Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

Asymmetric Models
Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

word_embedding_model = models.Transformer(base_model, max_seq_length=250)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])

##Your input examples have to look like this:
inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)

##Encoding (Note: Mixed inputs are not allowed)
model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])

Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer.
More documentation on how to design asymmetric models will follow soon.

New Namespace & Models for Cross-Encoder
Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

Logging
Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

Unit tests
A lot more unit tests have been added, which test the different components of the framework.

v0.4.0 - Upgrade Transformers Version

22 Dec 13:42
Compare
Choose a tag to compare
  • Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
  • New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
  • New application example for information retrieval and question answering retrieval. Together with respective pre-trained models

v0.3.9 - Small updates

18 Nov 08:25
Compare
Choose a tag to compare

This release only include some smaller updates:

  • Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
  • As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
  • model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
  • The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
  • The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.

v0.3.8 - CrossEncoder, Data Augmentation, new Models

19 Oct 14:23
Compare
Choose a tag to compare
  • Add support training and using CrossEncoder
  • Data Augmentation method AugSBERT added
  • New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
  • New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
  • Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
  • New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

Smaller changes:

  • Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
  • SentenceTransformer.encode method detaches tensors from compute graph
  • SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

29 Sep 20:17
Compare
Choose a tag to compare
  • Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
  • Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
  • Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

Minor changes:

  • Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
  • Added models.Normalize() to allow the normalization of embeddings to unit length

v0.3.6 - Update transformers to v3.1.0

11 Sep 08:06
Compare
Choose a tag to compare

Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.

v0.3.5 - Automatic Mixed Precision & Bugfixes

01 Sep 13:09
Compare
Choose a tag to compare
  • The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting model.fit(use_amp=True), AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory.
  • Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk
  • If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.
  • Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0
  • Several bugfixes: Downloading of files, mutli-GPU-encoding

v0.3.4 - Improved Documentation, Improved Tokenization Speed, Mutli-GPU encoding

24 Aug 16:24
Compare
Choose a tag to compare
  • The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
  • The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set num_workers to a positive integer in your DataLoader, tokenization will happen in a background thread. This substantially increases the start-up time for training.
  • model.encode() uses also a PyTorch DataSet + DataLoader. If you set num_workers to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.
  • Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
  • Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
  • Smaller bugfixes

Breaking changes:

  • Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator

v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements

06 Aug 08:16
Compare
Choose a tag to compare

New Functions

  • Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
  • Tokenization of datasets for training can now run in parallel (Linux Only)
  • New example for Quora Duplicate Questions Retrieval: See examples-folder
  • Many small improvements for training better models for Information Retrieval
  • Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
  • Added new Evaluators for ParaphraseMining and InformationRetrieval
  • evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
  • model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
  • New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
  • New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

Breaking Changes

  • The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.

v0.3.2 - Lazy tokenization for Parallel Sentence Training & Improved Semantic Search

23 Jul 15:03
Compare
Choose a tag to compare

This is a minor release. There should be no breaking changes.

  • ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
  • util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
  • SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings