Integrate with Sentence Transformers v5.4

#6
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Integrate this model as a Sentence Transformers SentenceTransformer

Details

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with ["hidden_states", -1] output extraction (since the custom LlamaNemotronVLModel returns CausalLMOutputWithPast without last_hidden_state), followed by mean pooling to produce 2048-dimensional embeddings.

The model's custom processor (LlamaNemotronVLProcessor) didn't expose __call__, which Sentence Transformers needs to pass inputs. I've added a __call__ method that delegates to process_documents for image and image+text inputs (which handles the custom image tiling, <IMG_CONTEXT> token creation, and passage: prefix), and falls back to direct tokenization for text-only inputs (where Sentence Transformers has already applied the prompt prefix). This gives full support for all three modalities: text, image, and image+text. See also https://huggingface.co/nvidia/llama-nemotron-rerank-vl-1b-v2/discussions/9 which has similar changes.

Added files:

  • modules.json: pipeline with Transformer & Pooling(mean) modules
  • sentence_bert_config.json: feature-extraction task with text, image, and image+text modality configs
  • config_sentence_transformers.json: SentenceTransformer model type with query: / passage: prompts and cosine similarity
  • 1_Pooling/config.json: mean pooling with 2048 embedding dimension

Changed files:

  • processing_llama_nemotron_vl.py: added __call__ for Sentence Transformers compatibility
  • README.md: added sentence-transformers tag, sentence-similarity pipeline tag, and a "Sentence Transformers Usage" section showing all three modalities

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

from sentence_transformers import SentenceTransformer
from transformers.image_utils import load_image

model = SentenceTransformer("nvidia/llama-nemotron-embed-vl-1b-v2", trust_remote_code=True, revision="refs/pr/6")

query = "How is AI improving the intelligence and capabilities of robots?"
documents = [
    "AI enables robots to perceive, plan, and act autonomously.",
    "AI is transforming autonomous vehicles by enabling safer, smarter, and more reliable decision-making on the road.",
    "A biological foundation model designed to analyze and generate DNA, RNA, and protein sequences.",
]
# Images can be URLs, file paths, or PIL Images
images = [
    load_image("https://developer.download.nvidia.com/images/isaac/nvidia-isaac-lab-1920x1080.jpg"),
    load_image("https://blogs.nvidia.com/wp-content/uploads/2018/01/automotive-key-visual-corp-blog-level4-av-og-1280x680-1.png"),
    load_image("https://developer-blogs.nvidia.com/wp-content/uploads/2025/02/hc-press-evo2-nim-25-featured-b.jpg"),
]

# Text-only encoding
query_embeddings = model.encode_query([query])
document_embeddings = model.encode_document(documents)
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.4142, 0.4046, 0.0421]])

# Image-only encoding
image_embeddings = model.encode(images)
similarities = model.similarity(query_embeddings, image_embeddings)
print(similarities)
# tensor([[ 0.2766,  0.1378, -0.0062]])

# Image+text encoding (e.g. page image + OCR text)
multimodal_docs = [{"image": img, "text": txt} for img, txt in zip(images, documents)]
multimodal_embeddings = model.encode(multimodal_docs)
similarities = model.similarity(query_embeddings, multimodal_embeddings)
print(similarities)
# tensor([[0.3385, 0.2688, 0.0077]])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

  • Tom Aarsen
tomaarsen changed pull request status to open

cc @BoLiu @nvidia-oliver-holworthy

I wanted to reach out with a bit of extra context over Slack, but it seems that we're not in any of the same Hugging Face <> NVIDIA Slack Channels, so I'll just message you here!
It has been my goal for some time to be able to integrate all models from https://huggingface.co/collections/nvidia/nemotron-rag in Sentence Transformers, not just the text ones like https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2. Tomorrow, that should become a lot more viable, as I'll be releasing multimodality support for Sentence Transformers, and I've already done the integration work for your nemotron-(embed/rerank)-vl and omni-embed-nemotron models.

As you'll notice in the PRs, the changes are almost exclusively additive (except increasing the model max_length from 900 to 32768 for https://huggingface.co/nvidia/omni-embed-nemotron-3b, matching what the README says). This means that there won't be any breaks of the existing functionality and integrations, but just a new simple interface for accessing these models.

Here's all PRs ready for review:

Ideally, I'd like to extend that to also include these:

In terms of timing: the release blogpost goes live in about 24 hours, and I plan to showcase these 3 models as integrated. If the PRs are merged, users would get a seamless day-0 experience, and otherwise I'll have to stick with revision="refs/pr/...", but this adds some friction that I'd love to avoid.

Happy to answer any questions!

  • Tom Aarsen
NVIDIA org

Hi @tomaarsen , apologies for the late reply. We just noticed this PR. Neither Oliver nor myself got email notification on your message for some reason.

We will review and merge them today. Have you published the blogpost yet?

All good! The blogpost is indeed live as of ~1 hour ago (https://huggingface.co/blog/multimodal-sentence-transformers), but I would be more than happy to edit it to remove the revision mentions in both the blogpost and the documentation (e.g. https://sbert.net/docs/cross_encoder/pretrained_models.html#multimodal-rerankers) the moment that the PRs are merged.

P.s. I also noticed in https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2/commit/4fb4c433f6e7eaa7c8a5502dade83c5656564534 that your variant of load_image here (https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2/blob/main/processing_llama_nemotron_vl.py#L73-L74) doesn't pass any user-agent alongside the request, and so the model currently only works with PIL Images, local paths, etc., but not (most) URLs. For example, https://developer.download.nvidia.com/images/isaac/nvidia-isaac-lab-1920x1080.jpg or Wikipedia URLs are blocked without a user-agent.

  • Tom Aarsen

Hi @tomaarsen ,

Thanks for this work. Really appreciate you taking the time to integrate this into Sentence Transformers, and the detailed write-up here.

I've reviewed and tested the changes and everything looks good. The call in the preprocessor is a nice approach to make this standardized across models.

The new multimodal support in SentenceTransformers is a great addition. Exciting to see first-class support for image and image+text embedding alongside text!

Good catch on the load_image user-agent issue too, we'll can get that fixed on our end.

Merging this PR. Apologies for not getting to them before the blog post went live.

nvidia-oliver-holworthy changed pull request status to merged

Perfect, thanks for reviewing, I appreciate it! The model inference looks very solid. The blogpost my docs are now updated to remove the revision for your models 🤗

  • Tom Aarsen

Sign up or log in to comment