Spark NLP 5.5.3: Enhanced Embeddings, Fixed Attention Masks, Performance Optimizations, and 100K Free Models
Latest📢 Spark NLP: Enhanced Embeddings, Fixed Attention Masks, and Performance Optimizations
Introduction
We’re excited to introduce the latest release of Spark NLP 5.5.3, featuring critical enhancements and bug fixes for several of our Text Embeddings annotators. These improvements ensure even more reliable and efficient performance for your NLP workflows.
But that’s not all—we’re also celebrating a major milestone: crossing 100,000 truly free and open models on our Models Hub! This achievement underscores our commitment to making state-of-the-art NLP accessible to everyone, forever.
Upgrade today to take advantage of these enhancements, and thank you for being part of the Spark NLP community. Your support and contributions continue to drive innovation forward!
🔥 Highlights
- Enhanced BGE Embeddings with configurable pooling strategies
- Fixed attention mask padding across multiple embedding models
- Major performance optimizations for transformer models
- Improved model default configurations and traits
🚀 New Features
Enhanced BGE Embeddings
Previously, BGE embeddings used a fixed pooling strategy that didn't match all model variants, resulting in suboptimal performance for some models (cosine similarity around 0.97 compared to the original implementation). Different BGE models are trained with different pooling strategies - some use CLS token pooling while others use attention-based average pooling.
- Added new
useCLSToken
parameter to control embedding pooling strategy - Changed default pretrained model from "bge_base" to "bge_small_en_v1.5"
val embeddings = BGEEmbeddings.pretrained("bge_small_en_v1.5")
.setUseCLSToken(true) // Use CLS token pooling (default)
.setInputCols("document")
.setOutputCol("embeddings")
🛠 Improvements & Bug Fixes
Attention Mask Fixes
Fixed incorrect padding in attention mask calculations for multiple models:
- MPNet
- BGE
- E5
- Mxbai
- Nomic
- SnowFlake
- UAE
This fix ensures consistent results between native implementations and ONNX versions.
Other Fixes
- Fixed Llama3 download issues in Python
- Optimized OpenVINO and ONNX inference paths
- Enhanced code cleanup and standardization
🔄 Breaking Changes
BGE Embeddings Updates
-
Default Model Change:
- Old default: "bge_base"
- New default: "bge_small_en_v1.5"
- Action required: Explicitly specify "bge_base" if needed
-
Pooling Strategy:
- New
useCLSToken
parameter defaults to True - May affect embedding calculations
- Action required: Verify existing implementations and set parameter explicitly if needed
- New
💡 Usage Examples
Specifying BGE Model Version
// Using new default
val embeddingsNew = BGEEmbeddings.pretrained()
// Using previous default explicitly
val embeddingsOld = BGEEmbeddings.pretrained("bge_base")
Configuring Pooling Strategy
// Using CLS token pooling
val embeddingsCLS = BGEEmbeddings.pretrained()
.setUseCLSToken(true)
// Using attention-based average pooling
val embeddingsAvg = BGEEmbeddings.pretrained()
.setUseCLSToken(false)
❤️ Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP! - Medium Spark NLP articles
- JohnSnowLabs official Medium
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==5.5.3
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.5.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.5.3
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.5.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.5.3
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.5.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.5.3
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.5.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.5.3</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.5.3</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.5.3</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.5.3.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.5.3.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.5.3.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.5.3.jar
What's Changed
- Models hub by @maziyarpanahi in #14490
- LLama3 download fix by @C-K-Loan in #14508
- Bugfix: wrong attention mask calculation resulted in wrong embeddings by @maziyarpanahi in #14496
- Models hub by @maziyarpanahi in #14513
- Release/553 release candidate by @maziyarpanahi in #14511
Full Changelog: 5.5.2...5.5.3