Qwen3-VL-Embedding-2B model trained on

This is a sentence-transformers model finetuned from tomaarsen/Qwen3-VL-Embedding-2B on the vdr-multilingual-train dataset. It maps sentences & paragraphs to a 2048-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: tomaarsen/Qwen3-VL-Embedding-2B
  • Maximum Sequence Length: 262144 tokens
  • Output Dimensionality: 2048 dimensions
  • Similarity Function: Cosine Similarity
  • Supported Modalities: Text, Image, Video, Message
  • Training Dataset:
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'image': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'video': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'message': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'message_format': 'structured', 'processing_kwargs': {'chat_template': {'add_generation_prompt': True}}, 'unpad_inputs': False, 'architecture': 'Qwen3VLModel'})
  (1): Pooling({'embedding_dimension': 2048, 'pooling_mode': 'lasttoken', 'include_prompt': True})
  (2): Normalize({})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/qwen3-vl-2b-vdr")
# Run inference
queries = [
    'What is the quarter-on-quarter growth rate of Klook in Asia-Pacific as of October 2022?',
]
documents = [
    'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_0.jpg',
    'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_1.jpg',
    'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_2.jpg',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.5789, 0.0973, 0.0304]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.9533
cosine_accuracy@3 0.99
cosine_accuracy@5 0.9933
cosine_accuracy@10 0.9933
cosine_precision@1 0.9533
cosine_precision@3 0.33
cosine_precision@5 0.1987
cosine_precision@10 0.0993
cosine_recall@1 0.9533
cosine_recall@3 0.99
cosine_recall@5 0.9933
cosine_recall@10 0.9933
cosine_ndcg@10 0.9764
cosine_mrr@10 0.9707
cosine_map@100 0.9709

Training Details

Training Dataset

vdr-multilingual-train

  • Dataset: vdr-multilingual-train at 6b92b5c
  • Size: 10,000 training samples
  • Columns: query, image, and negative_0
  • Approximate statistics based on the first 1000 samples:
    query image negative_0
    type string image image
    details
    • min: 26 tokens
    • mean: 36.31 tokens
    • max: 62 tokens
    • min: 700x709 px
    • mean: 1416x1648 px
    • max: 2100x2064 px
    • min: 827x709 px
    • mean: 1438x1633 px
    • max: 2583x1897 px
  • Samples:
    query image negative_0
    What are the new anthropological perspectives on development as discussed by Quarles Van Ufford and Giri in 2003?
    What are the three main positions anthropologists have taken in relation to development, as discussed by David Lewis?
    Who are the three sisters known as the Fates in Greek mythology?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CachedMultipleNegativesRankingLoss",
        "matryoshka_dims": [
            2048,
            1024,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Evaluation Dataset

vdr-multilingual-test

  • Dataset: vdr-multilingual-test at 9e26ae1
  • Size: 300 evaluation samples
  • Columns: query and image
  • Approximate statistics based on the first 300 samples:
    query image
    type string image
    details
    • min: 27 tokens
    • mean: 34.26 tokens
    • max: 65 tokens
    • min: 827x1125 px
    • mean: 1371x1709 px
    • max: 2045x2045 px
  • Samples:
    query image
    What is the quarter-on-quarter growth rate of Klook in Asia-Pacific as of October 2022?
    When should spinach be planted and harvested?
    How does the discharge of sewage into a river affect the concentration of dissolved oxygen?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CachedMultipleNegativesRankingLoss",
        "matryoshka_dims": [
            2048,
            1024,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 64
  • num_train_epochs: 1
  • learning_rate: 2e-05
  • warmup_steps: 0.1
  • bf16: True
  • eval_strategy: steps
  • per_device_eval_batch_size: 64
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 64
  • num_train_epochs: 1
  • max_steps: -1
  • learning_rate: 2e-05
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 1
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: steps
  • per_device_eval_batch_size: 64
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss vdr-eval_cosine_ndcg@10
-1 -1 - - 0.9790
0.0510 8 7.9663 - -
0.1019 16 5.9054 4.6686 0.9826
0.1529 24 5.6008 - -
0.2038 32 5.6521 4.5979 0.9810
0.2548 40 5.7503 - -
0.3057 48 5.5388 4.6358 0.9802
0.3567 56 5.5883 - -
0.4076 64 5.4430 4.6014 0.9812
0.4586 72 5.4762 - -
0.5096 80 5.4937 4.6229 0.9785
0.5605 88 5.4991 - -
0.6115 96 5.2465 4.5517 0.9781
0.6624 104 5.1596 - -
0.7134 112 5.2998 4.6642 0.9777
0.7643 120 5.4130 - -
0.8153 128 5.2071 4.5448 0.9781
0.8662 136 5.1424 - -
0.9172 144 5.1973 4.6617 0.9764
0.9682 152 5.3651 - -
-1 -1 - - 0.9764

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Energy Consumed: 2.882 kWh
  • Carbon Emitted: 0.771 kg of CO2
  • Hours Used: 9.675 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
  • RAM Size: 31.78 GB

Framework Versions

  • Python: 3.11.6
  • Sentence Transformers: 5.4.0.dev0
  • Transformers: 5.3.0.dev0
  • PyTorch: 2.10.0+cu128
  • Accelerate: 1.13.0.dev0
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
32
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tomaarsen/qwen3-vl-2b-vdr

Finetuned
(1)
this model

Datasets used to train tomaarsen/qwen3-vl-2b-vdr

Papers for tomaarsen/qwen3-vl-2b-vdr

Evaluation results