Skip to content

Spark NLP 5.5.3: Enhanced Embeddings, Fixed Attention Masks, Performance Optimizations, and 100K Free Models

Latest
Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 30 Jan 16:15
7d2bed7

📢 Spark NLP: Enhanced Embeddings, Fixed Attention Masks, and Performance Optimizations

Introduction

We’re excited to introduce the latest release of Spark NLP 5.5.3, featuring critical enhancements and bug fixes for several of our Text Embeddings annotators. These improvements ensure even more reliable and efficient performance for your NLP workflows.

But that’s not all—we’re also celebrating a major milestone: crossing 100,000 truly free and open models on our Models Hub! This achievement underscores our commitment to making state-of-the-art NLP accessible to everyone, forever.

Upgrade today to take advantage of these enhancements, and thank you for being part of the Spark NLP community. Your support and contributions continue to drive innovation forward!

🔥 Highlights

  • Enhanced BGE Embeddings with configurable pooling strategies
  • Fixed attention mask padding across multiple embedding models
  • Major performance optimizations for transformer models
  • Improved model default configurations and traits

🚀 New Features

Enhanced BGE Embeddings

Previously, BGE embeddings used a fixed pooling strategy that didn't match all model variants, resulting in suboptimal performance for some models (cosine similarity around 0.97 compared to the original implementation). Different BGE models are trained with different pooling strategies - some use CLS token pooling while others use attention-based average pooling.

  • Added new useCLSToken parameter to control embedding pooling strategy
  • Changed default pretrained model from "bge_base" to "bge_small_en_v1.5"
val embeddings = BGEEmbeddings.pretrained("bge_small_en_v1.5")
  .setUseCLSToken(true)  // Use CLS token pooling (default)
  .setInputCols("document")
  .setOutputCol("embeddings")

🛠 Improvements & Bug Fixes

Attention Mask Fixes

Fixed incorrect padding in attention mask calculations for multiple models:

  • MPNet
  • BGE
  • E5
  • Mxbai
  • Nomic
  • SnowFlake
  • UAE

This fix ensures consistent results between native implementations and ONNX versions.

Other Fixes

  • Fixed Llama3 download issues in Python
  • Optimized OpenVINO and ONNX inference paths
  • Enhanced code cleanup and standardization

🔄 Breaking Changes

BGE Embeddings Updates

  1. Default Model Change:

    • Old default: "bge_base"
    • New default: "bge_small_en_v1.5"
    • Action required: Explicitly specify "bge_base" if needed
  2. Pooling Strategy:

    • New useCLSToken parameter defaults to True
    • May affect embedding calculations
    • Action required: Verify existing implementations and set parameter explicitly if needed

💡 Usage Examples

Specifying BGE Model Version

// Using new default
val embeddingsNew = BGEEmbeddings.pretrained()

// Using previous default explicitly
val embeddingsOld = BGEEmbeddings.pretrained("bge_base")

Configuring Pooling Strategy

// Using CLS token pooling
val embeddingsCLS = BGEEmbeddings.pretrained()
  .setUseCLSToken(true)

// Using attention-based average pooling
val embeddingsAvg = BGEEmbeddings.pretrained()
  .setUseCLSToken(false)

❤️ Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas,
    and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • JohnSnowLabs official Medium
  • YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.5.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.5.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.5.3

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.5.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.5.3

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.5.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.5.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.5.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.5.3</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.5.3</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.5.3</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 5.5.2...5.5.3