📢 Spark NLP: Enhanced Embeddings, Fixed Attention Masks, and Performance Optimizations

Introduction

We’re excited to introduce the latest release of Spark NLP 5.5.3, featuring critical enhancements and bug fixes for several of our Text Embeddings annotators. These improvements ensure even more reliable and efficient performance for your NLP workflows.

But that’s not all—we’re also celebrating a major milestone: crossing 100,000 truly free and open models on our Models Hub! This achievement underscores our commitment to making state-of-the-art NLP accessible to everyone, forever.

Upgrade today to take advantage of these enhancements, and thank you for being part of the Spark NLP community. Your support and contributions continue to drive innovation forward!

🔥 Highlights

Enhanced BGE Embeddings with configurable pooling strategies
Fixed attention mask padding across multiple embedding models
Major performance optimizations for transformer models
Improved model default configurations and traits

🚀 New Features

Enhanced BGE Embeddings

Previously, BGE embeddings used a fixed pooling strategy that didn't match all model variants, resulting in suboptimal performance for some models (cosine similarity around 0.97 compared to the original implementation). Different BGE models are trained with different pooling strategies - some use CLS token pooling while others use attention-based average pooling.

Added new useCLSToken parameter to control embedding pooling strategy
Changed default pretrained model from "bge_base" to "bge_small_en_v1.5"

val embeddings = BGEEmbeddings.pretrained("bge_small_en_v1.5")
  .setUseCLSToken(true)  // Use CLS token pooling (default)
  .setInputCols("document")
  .setOutputCol("embeddings")

🛠 Improvements & Bug Fixes

Attention Mask Fixes

Fixed incorrect padding in attention mask calculations for multiple models:

MPNet
BGE
E5
Mxbai
Nomic
SnowFlake
UAE

This fix ensures consistent results between native implementations and ONNX versions.

Other Fixes

Fixed Llama3 download issues in Python
Optimized OpenVINO and ONNX inference paths
Enhanced code cleanup and standardization

🔄 Breaking Changes

BGE Embeddings Updates

Default Model Change:
- Old default: "bge_base"
- New default: "bge_small_en_v1.5"
- Action required: Explicitly specify "bge_base" if needed
Pooling Strategy:
- New useCLSToken parameter defaults to True
- May affect embedding calculations
- Action required: Verify existing implementations and set parameter explicitly if needed

💡 Usage Examples

Specifying BGE Model Version

// Using new default
val embeddingsNew = BGEEmbeddings.pretrained()

// Using previous default explicitly
val embeddingsOld = BGEEmbeddings.pretrained("bge_base")

Configuring Pooling Strategy

// Using CLS token pooling
val embeddingsCLS = BGEEmbeddings.pretrained()
  .setUseCLSToken(true)

// Using attention-based average pooling
val embeddingsAvg = BGEEmbeddings.pretrained()
  .setUseCLSToken(false)

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.5.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.5.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.5.3

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.5.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.5.3

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.5.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.5.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.5.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.5.3</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.5.3</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.5.3</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.5.3.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.5.3.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.5.3.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.5.3.jar

What's Changed

Models hub by @maziyarpanahi in #14490
LLama3 download fix by @C-K-Loan in #14508
Bugfix: wrong attention mask calculation resulted in wrong embeddings by @maziyarpanahi in #14496
Models hub by @maziyarpanahi in #14513
Release/553 release candidate by @maziyarpanahi in #14511

Full Changelog: 5.5.2...5.5.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 5.5.3: Enhanced Embeddings, Fixed Attention Masks, Performance Optimizations, and 100K Free Models