fully support float16 and bfloat16 for embeddings (or at least float16) #3155

BBC-Esq · 2025-01-06T16:44:04Z

Currently, sentence-transformers' encode method only supports 'float32' and select quantized formats (int8, uint8, binary, ubinary) when returning embeddings. However, many embedding models can create embeddings in float16 and bfloat16...and..in-fact...sentence transformers supports loading them with those "dtypes."

Further...I understand that float16 is supported by numpy but bfloat16 is not.

Therefore, currently, even if a model is loaded into sentence transformers with a dtype of float16 or bfloat16 and "initially" produces embeddings, the embeddings themselves are converted to float32 or one of the quantizations. There is no way to currently keep them in float16 or bfloat16 throughout the entire process. Lots of vector databases can handle other datatypes...

Since numpy already supports float16 apparently, the ml_dtypes library (https://github.com/jax-ml/ml_dtypes) library could in theory be used to also support bfloat.

Can we please at least get "float16" supported as a supported "precision" parameter argument since numpy already supports it? It'd be nice to support bfloat16 as well though.

BBC-Esq · 2025-01-08T14:47:38Z

Correct me if I'm mistaken, but it seems that the only want to get embeddings in bfloat16 or float16 would be through the complex process of specifying "none" for the "precision" parameter and so on, according to this visualization. I'd be interested in hearing from you guys - for better or for worse - if you'll explicitly support float16 and/or bfloat16 embeddings being output, not merely being able to specify the "dtype" of the embedding model. Thanks. Mermaid graph as a courtesy:

graph TD
    A[The embeddings will initially have the same datatype as the embedding model, but this can be controlled with the 'dtype' parameter] --> B{Initial Datatype<br>e.g. float32/float16/bfloat16<br><br>Initially a PyTorch tensor}
    B -->|Initially float32,| C{convert_to_numpy parameter?}
    B -->|Initially float16| D{convert_to_numpy parameter?}
    B -->|Initially bfloat16| E{convert_to_numpy parameter?}
    
    C -->|True| F[Convert to float32 numpy array]
    C -->|False| G[Keep as float32 PyTorch tensor]
    C -->|Not Specified| G
    
    D -->|True| H[Convert to float32 numpy array]
    D -->|False| I[Keep as float16 PyTorch tensor]
    D -->|Not Specified| I
    
    E -->|True| J[Convert to float32 numpy array]
    E -->|False| K[Keep as bfloat16 PyTorch tensor]
    E -->|Not Specified| K
    
    F --> L{Precision parameter within the 'encode' method?}
    G --> L
    H --> L
    I --> L
    J --> L
    K --> L
    
    L -->|'None' Value Used| M[Keep current datatype and format]--> S
    L -->|Parameter not used| N[Originally float16 embedding converted to float16 numpy array.<br><br>Originally float32 or bfloat16 embeddings converted to float32 numpy array]--> S
    L -->|Explicit value used| O{Precision parameter accepts 'float32,' 'int8,' 'uint8,' 'binary,' and 'ubinary'}
    
    O -->|float32| P[Converted to float32 numpy array]--> S
    O -->|int8/uint8| Q[Converted to float32 numpy array then then Linear Quantization to 8-bit integers]--> S
    O -->|binary/ubinary| R[Converted to float32 numpy array then Binary Quantization packed bits]--> S

    S{convert_to_tensor parameter?}
    S -->|True| T[Converted to a single stacked PyTorch tensor<br><br>Overrides any 'convert_to_numpy' setting]
    S -->|False| U{convert_to_numpy parameter?}
    S -->|Not Specified| U

    U -->|True| V[Convert to a single numpy array]
    U -->|False| W[Remain a list of PyTorch tensors]
    U -->|Not Specified| W

    style A fill:#2C3E50,stroke:#fff,color:#fff
    style B fill:#34495E,stroke:#fff,color:#fff
    style C fill:#34495E,stroke:#fff,color:#fff
    style D fill:#34495E,stroke:#fff,color:#fff
    style E fill:#34495E,stroke:#fff,color:#fff
    style F fill:#2980B9,stroke:#fff,color:#fff
    style G fill:#2980B9,stroke:#fff,color:#fff
    style H fill:#2980B9,stroke:#fff,color:#fff
    style I fill:#2980B9,stroke:#fff,color:#fff
    style J fill:#2980B9,stroke:#fff,color:#fff
    style K fill:#2980B9,stroke:#fff,color:#fff
    style L fill:#34495E,stroke:#fff,color:#fff
    style M fill:#16A085,stroke:#fff,color:#fff
    style N fill:#16A085,stroke:#fff,color:#fff
    style O fill:#34495E,stroke:#fff,color:#fff
    style P fill:#16A085,stroke:#fff,color:#fff
    style Q fill:#16A085,stroke:#fff,color:#fff
    style R fill:#16A085,stroke:#fff,color:#fff
    style S fill:#34495E,stroke:#fff,color:#fff
    style T fill:#27AE60,stroke:#fff,color:#fff
    style U fill:#34495E,stroke:#fff,color:#fff
    style V fill:#27AE60,stroke:#fff,color:#fff
    style W fill:#27AE60,stroke:#fff,color:#fff

BBC-Esq · 2025-01-08T16:18:32Z

Since bfloat16 support is a little more difficult and might require tensorflow, can we at least get direct support for saving embeddings into float16? This would halve the storage required without having to rely on quantized embeddings for those of us who want the middle ground?

See e.g. here for the tensorflow issue:

https://github.com/milvus-io/pymilvus/blob/master/examples/datatypes/bfloat16_example.py

BBC-Esq · 2025-01-15T12:39:42Z

Any thoughts on this?

tomaarsen · 2025-01-16T12:49:26Z

Apologies for the delay, I've been a bit distracted with my Static Embeddings blogpost.

I'm a little unsure about what you'd like to change. These are the outputs:

	`model`; fp32	`model.half()`; fp16	`model.bfloat16()`; bf16
`convert_to_tensor=True`	`torch.float32`	`torch.float16`	`torch.bfloat16`
(default)	`float32`	`float16`	`float32`*

The value marked with a * is the only one that's perhaps a bit unexpected, but it's caused by bfloat16 not having an "out of the box" numpy variant without additional dependencies.

In short, I believe that this is not true:

Therefore, currently, even if a model is loaded into sentence transformers with a dtype of float16 or bfloat16 and "initially" produces embeddings, the embeddings themselves are converted to float32 or one of the quantizations. There is no way to currently keep them in float16 or bfloat16 throughout the entire process. Lots of vector databases can handle other datatypes...

Snippets:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
# model.half()
model.bfloat16()

output = model.encode("Hello!", convert_to_tensor=True)
print(type(output))
print(output.dtype)

Tom Aarsen

BBC-Esq · 2025-01-16T12:52:52Z

I'll review this and may or may not follow up...but coincidentally I was just reading your blogpost about the new kinds of embedding models actually.

BBC-Esq changed the title ~~fully support bfloat16 please despite numpy limitation~~ fully support float16 and bfloat16 for embeddings Jan 6, 2025

BBC-Esq changed the title ~~fully support float16 and bfloat16 for embeddings~~ fully support float16 and bfloat16 for embeddings (or at least float16) Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fully support float16 and bfloat16 for embeddings (or at least float16) #3155

fully support float16 and bfloat16 for embeddings (or at least float16) #3155

BBC-Esq commented Jan 6, 2025 •

edited

Loading

BBC-Esq commented Jan 8, 2025 •

edited

Loading

BBC-Esq commented Jan 8, 2025 •

edited

Loading

BBC-Esq commented Jan 15, 2025

tomaarsen commented Jan 16, 2025

BBC-Esq commented Jan 16, 2025

fully support float16 and bfloat16 for embeddings (or at least float16) #3155

fully support float16 and bfloat16 for embeddings (or at least float16) #3155

Comments

BBC-Esq commented Jan 6, 2025 • edited Loading

BBC-Esq commented Jan 8, 2025 • edited Loading

BBC-Esq commented Jan 8, 2025 • edited Loading

BBC-Esq commented Jan 15, 2025

tomaarsen commented Jan 16, 2025

BBC-Esq commented Jan 16, 2025

BBC-Esq commented Jan 6, 2025 •

edited

Loading

BBC-Esq commented Jan 8, 2025 •

edited

Loading

BBC-Esq commented Jan 8, 2025 •

edited

Loading