Sentence Similarity
sentence-transformers
Safetensors
Transformers
qwen3_vl
image-text-to-text
multimodal embedding
qwen
embedding
Instructions to use Qwen/Qwen3-VL-Embedding-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Qwen/Qwen3-VL-Embedding-2B with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use Qwen/Qwen3-VL-Embedding-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Embedding-2B") model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3-VL-Embedding-2B") - Notebooks
- Google Colab
- Kaggle
Add pipeline_tag, library_name and paper link
#11
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,30 +1,31 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
- Qwen/Qwen3-VL-2B-Instruct
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- transformers
|
| 7 |
- multimodal embedding
|
| 8 |
---
|
|
|
|
| 9 |
# Qwen3-VL-Embedding-2B
|
| 10 |
|
| 11 |
<p align="center">
|
| 12 |
<img src="https://model-demo.oss-cn-hangzhou.aliyuncs.com/Qwen3-VL-Embedding.png" width="400"/>
|
| 13 |
</p>
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
The
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
|
|
|
|
| 23 |
- **Unified Representation Learning (Embedding)**: By leveraging the Qwen3-VL architecture, the Embedding model generates semantically rich vectors that capture both visual and textual information in a shared space. This facilitates efficient similarity computation and retrieval across different modalities.
|
| 24 |
-
|
| 25 |
-
- **High-Precision Reranking (Reranker)**: We also introduce the Qwen3-VL-Reranker series to complement the embedding model. The reranker takes a (query, document) pair as input—where both query and document may contain arbitrary single or mixed modalities—and outputs a precise relevance score. In retrieval pipelines, the two models are typically used in tandem: the embedding model performs efficient initial recall, while the reranker refines results in a subsequent re-ranking stage. This two-stage approach significantly boosts retrieval accuracy.
|
| 26 |
-
|
| 27 |
-
- **Exceptional Practicality**: Inheriting Qwen3-VL’s multilingual capabilities, the series supports over 30 languages, making it ideal for global applications. It is highly practical for real-world scenarios, offering flexible vector dimensions, customizable instructions for specific use cases, and strong performance even with quantized embeddings. These capabilities enable developers to seamlessly integrate both models into existing pipelines, unlocking powerful cross-lingual and cross-modal understanding.
|
| 28 |
|
| 29 |
## Model Overview
|
| 30 |
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen3-VL-2B-Instruct
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
library_name: transformers
|
| 6 |
+
pipeline_tag: feature-extraction
|
| 7 |
tags:
|
| 8 |
- transformers
|
| 9 |
- multimodal embedding
|
| 10 |
---
|
| 11 |
+
|
| 12 |
# Qwen3-VL-Embedding-2B
|
| 13 |
|
| 14 |
<p align="center">
|
| 15 |
<img src="https://model-demo.oss-cn-hangzhou.aliyuncs.com/Qwen3-VL-Embedding.png" width="400"/>
|
| 16 |
</p>
|
| 17 |
|
| 18 |
+
Qwen3-VL-Embedding-2B is part of the latest extensions of the Qwen family, built on the Qwen3-VL foundation model. It provides an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities—including text, images, document images, and video—into a unified representation space.
|
| 19 |
|
| 20 |
+
The model was presented in the paper [Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking](https://huggingface.co/papers/2601.04720).
|
| 21 |
|
| 22 |
+
## Highlights
|
| 23 |
|
| 24 |
+
The **Qwen3-VL-Embedding** and **Qwen3-VL-Reranker** model series are specifically designed for multimodal information retrieval and cross-modal understanding. This suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities.
|
| 25 |
|
| 26 |
+
- **Multimodal Versatility**: Both models seamlessly handle a wide range of inputs—including text, images, screenshots, and video—within a unified framework. They deliver state-of-the-art performance across diverse multimodal tasks such as image-text retrieval, video-text matching, visual question answering (VQA), and multimodal content clustering.
|
| 27 |
- **Unified Representation Learning (Embedding)**: By leveraging the Qwen3-VL architecture, the Embedding model generates semantically rich vectors that capture both visual and textual information in a shared space. This facilitates efficient similarity computation and retrieval across different modalities.
|
| 28 |
+
- **Exceptional Practicality**: Inheriting Qwen3-VL’s multilingual capabilities, the series supports over 30 languages, making it ideal for global applications. It offers flexible vector dimensions (Matryoshka Representation Learning), customizable instructions for specific use cases, and strong performance even with quantized embeddings.
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
## Model Overview
|
| 31 |
|