Matryoshka Representation Learning
Paper • 2205.13147 • Published • 25
This is a sentence-transformers model finetuned from tomaarsen/Qwen3-VL-Embedding-2B on the vdr-multilingual-train dataset. It maps sentences & paragraphs to a 2048-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'image': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'video': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'message': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'message_format': 'structured', 'processing_kwargs': {'chat_template': {'add_generation_prompt': True}}, 'unpad_inputs': False, 'architecture': 'Qwen3VLModel'})
(1): Pooling({'embedding_dimension': 2048, 'pooling_mode': 'lasttoken', 'include_prompt': True})
(2): Normalize({})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/qwen3-vl-2b-vdr")
# Run inference
queries = [
'What is the quarter-on-quarter growth rate of Klook in Asia-Pacific as of October 2022?',
]
documents = [
'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_0.jpg',
'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_1.jpg',
'https://huggingface.co/tomaarsen/qwen3-vl-2b-vdr/resolve/main/assets/image_2.jpg',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.5789, 0.0973, 0.0304]])
vdr-evalInformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.9533 |
| cosine_accuracy@3 | 0.99 |
| cosine_accuracy@5 | 0.9933 |
| cosine_accuracy@10 | 0.9933 |
| cosine_precision@1 | 0.9533 |
| cosine_precision@3 | 0.33 |
| cosine_precision@5 | 0.1987 |
| cosine_precision@10 | 0.0993 |
| cosine_recall@1 | 0.9533 |
| cosine_recall@3 | 0.99 |
| cosine_recall@5 | 0.9933 |
| cosine_recall@10 | 0.9933 |
| cosine_ndcg@10 | 0.9764 |
| cosine_mrr@10 | 0.9707 |
| cosine_map@100 | 0.9709 |
query, image, and negative_0| query | image | negative_0 | |
|---|---|---|---|
| type | string | image | image |
| details |
|
|
|
| query | image | negative_0 |
|---|---|---|
What are the new anthropological perspectives on development as discussed by Quarles Van Ufford and Giri in 2003? |
![]() |
![]() |
What are the three main positions anthropologists have taken in relation to development, as discussed by David Lewis? |
![]() |
![]() |
Who are the three sisters known as the Fates in Greek mythology? |
![]() |
![]() |
MatryoshkaLoss with these parameters:{
"loss": "CachedMultipleNegativesRankingLoss",
"matryoshka_dims": [
2048,
1024,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
query and image| query | image | |
|---|---|---|
| type | string | image |
| details |
|
|
| query | image |
|---|---|
What is the quarter-on-quarter growth rate of Klook in Asia-Pacific as of October 2022? |
![]() |
When should spinach be planted and harvested? |
![]() |
How does the discharge of sewage into a river affect the concentration of dissolved oxygen? |
![]() |
MatryoshkaLoss with these parameters:{
"loss": "CachedMultipleNegativesRankingLoss",
"matryoshka_dims": [
2048,
1024,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
per_device_train_batch_size: 64num_train_epochs: 1learning_rate: 2e-05warmup_steps: 0.1bf16: Trueeval_strategy: stepsper_device_eval_batch_size: 64batch_sampler: no_duplicatesper_device_train_batch_size: 64num_train_epochs: 1max_steps: -1learning_rate: 2e-05lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_steps: 0.1optim: adamw_torch_fusedoptim_args: Noneweight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08optim_target_modules: Nonegradient_accumulation_steps: 1average_tokens_across_devices: Truemax_grad_norm: 1.0label_smoothing_factor: 0.0bf16: Truefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Nonetorch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneuse_liger_kernel: Falseliger_kernel_config: Noneuse_cache: Falseneftune_noise_alpha: Nonetorch_empty_cache_steps: Noneauto_find_batch_size: Falselog_on_each_node: Truelogging_nan_inf_filter: Trueinclude_num_input_tokens_seen: nolog_level: passivelog_level_replica: warningdisable_tqdm: Falseproject: huggingfacetrackio_space_id: trackioeval_strategy: stepsper_device_eval_batch_size: 64prediction_loss_only: Trueeval_on_start: Falseeval_do_concat_batches: Trueeval_use_gather_object: Falseeval_accumulation_steps: Noneinclude_for_metrics: []batch_eval_metrics: Falsesave_only_model: Falsesave_on_each_node: Falseenable_jit_checkpoint: Falsepush_to_hub: Falsehub_private_repo: Nonehub_model_id: Nonehub_strategy: every_savehub_always_push: Falsehub_revision: Noneload_best_model_at_end: Falseignore_data_skip: Falserestore_callback_states_from_checkpoint: Falsefull_determinism: Falseseed: 42data_seed: Noneuse_cpu: Falseaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedataloader_drop_last: Falsedataloader_num_workers: 0dataloader_pin_memory: Truedataloader_persistent_workers: Falsedataloader_prefetch_factor: Noneremove_unused_columns: Truelabel_names: Nonetrain_sampling_strategy: randomlength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falseddp_backend: Noneddp_timeout: 1800fsdp: []fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}deepspeed: Nonedebug: []skip_memory_metrics: Truedo_predict: Falseresume_from_checkpoint: Nonewarmup_ratio: Nonelocal_rank: -1prompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss | Validation Loss | vdr-eval_cosine_ndcg@10 |
|---|---|---|---|---|
| -1 | -1 | - | - | 0.9790 |
| 0.0510 | 8 | 7.9663 | - | - |
| 0.1019 | 16 | 5.9054 | 4.6686 | 0.9826 |
| 0.1529 | 24 | 5.6008 | - | - |
| 0.2038 | 32 | 5.6521 | 4.5979 | 0.9810 |
| 0.2548 | 40 | 5.7503 | - | - |
| 0.3057 | 48 | 5.5388 | 4.6358 | 0.9802 |
| 0.3567 | 56 | 5.5883 | - | - |
| 0.4076 | 64 | 5.4430 | 4.6014 | 0.9812 |
| 0.4586 | 72 | 5.4762 | - | - |
| 0.5096 | 80 | 5.4937 | 4.6229 | 0.9785 |
| 0.5605 | 88 | 5.4991 | - | - |
| 0.6115 | 96 | 5.2465 | 4.5517 | 0.9781 |
| 0.6624 | 104 | 5.1596 | - | - |
| 0.7134 | 112 | 5.2998 | 4.6642 | 0.9777 |
| 0.7643 | 120 | 5.4130 | - | - |
| 0.8153 | 128 | 5.2071 | 4.5448 | 0.9781 |
| 0.8662 | 136 | 5.1424 | - | - |
| 0.9172 | 144 | 5.1973 | 4.6617 | 0.9764 |
| 0.9682 | 152 | 5.3651 | - | - |
| -1 | -1 | - | - | 0.9764 |
Carbon emissions were measured using CodeCarbon.
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
Qwen/Qwen3-VL-2B-Instruct