Vision Matryoshka Embeddings (ViT-L/14-1024D)
A vision encoder producing 1024-dimensional embeddings with Matryoshka Representation Learning, aligned with the lumees/lumees-matryoshka-embedding-v1 text embedding model.
Model Description
This model extends vision-language alignment to support flexible embedding dimensions from 64D to 1024D through Matryoshka Representation Learning. It enables efficient deployment scenarios where you can trade off between speed and accuracy by selecting different embedding dimensions.
Key Features:
- 🎯 Aligned with lumees text embeddings for multimodal retrieval
- 📐 Matryoshka dimensions: 64D, 128D, 256D, 384D, 512D, 768D, 1024D
- ⚡ Flexible deployment: Use smaller dimensions for faster inference
- 🎨 Vision backbone: ViT-Large/14 (CLIP pretrained)
- 🔄 Projection head: Maps to 1024D embedding space
Architecture
Input Image (224×224)
↓
ViT-Large/14 Backbone (pretrained)
↓
Pooling (CLS token)
↓
Projection Head (→ 1024D)
↓
Matryoshka Layer Norms
↓
Output: {64D, 128D, 256D, 384D, 512D, 768D, 1024D}
Components:
- Vision Backbone:
vit_large_patch14_clip_224.openai(307M parameters) - Projection Head: 2-layer MLP (1024 → 2048 → 1024)
- Training: Contrastive learning (InfoNCE) with frozen text encoder
- Alignment: Multi-scale contrastive loss across all Matryoshka dimensions
Performance
Evaluated on 50 diverse curated samples:
| Metric | @ 512D |
|---|---|
| Image→Text R@1 | 30.0% |
| Image→Text R@5 | 60.0% |
| Text→Image R@1 | 26.0% |
| Text→Image R@5 | 64.0% |
| Average R@1 | 28.0% |
| rSum | 326.0 |
Matryoshka Dimension Performance
Performance remains consistent across all dimensions:
| Dimension | I→T R@1 | T→I R@1 | Relative to 1024D |
|---|---|---|---|
| 128D | 70% | 70% | 100% |
| 256D | 70% | 70% | 100% |
| 512D ⭐ | 70% | 70% | 100% |
| 768D | 70% | 70% | 100% |
| 1024D | 70% | 70% | 100% |
⭐ Recommended: 512D offers the best speed/accuracy tradeoff
Quick Start
Installation
pip install torch torchvision timm sentence-transformers pillow requests
Usage
import torch
from PIL import Image
from torchvision import transforms
from sentence_transformers import SentenceTransformer
import requests
from io import BytesIO
# Load model architecture (you need the model code)
from model import MatryoshkaVisionEncoder
# Initialize model
vision_model = MatryoshkaVisionEncoder(
vision_model_name='vit_large_patch14_clip_224.openai',
output_dim=1024,
matryoshka_dims=[64, 128, 256, 384, 512, 768, 1024]
).cuda()
# Load weights
checkpoint = torch.load('pytorch_model.bin')
vision_model.load_state_dict(checkpoint['model_state_dict'])
vision_model.eval()
# Load aligned text model
text_model = SentenceTransformer('lumees/lumees-matryoshka-embedding-v1')
# Prepare image
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Load image
url = "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=400"
image = Image.open(BytesIO(requests.get(url).content)).convert('RGB')
image_tensor = transform(image).unsqueeze(0).cuda()
# Encode image at 512D (recommended)
with torch.no_grad():
vision_embeds = vision_model(image_tensor)
image_embed = vision_embeds[512].cpu().numpy()
# Encode text at 512D
text_embed = text_model.encode(["a cat sitting by the window"], truncate_dim=512)
# Compute similarity
similarity = (image_embed @ text_embed.T)[0, 0]
print(f"Similarity: {similarity:.4f}")
Image-Text Retrieval
# Text-to-Image Search
query = "a cute cat"
query_embed = text_model.encode([query], truncate_dim=512)
# Compare with image database
image_database_embeds = ... # Your pre-computed embeddings
similarities = query_embed @ image_database_embeds.T
top_5 = similarities[0].argsort()[-5:][::-1]
# Image-to-Text Search
image_embed = vision_embeds[512].cpu().numpy()
text_database_embeds = ... # Your pre-computed embeddings
similarities = image_embed @ text_database_embeds.T
top_5 = similarities[0].argsort()[-5:][::-1]
Training Details
Model Configuration
- Base Model: ViT-Large/14 (CLIP pretrained from OpenAI)
- Projection: 2-layer MLP with GELU activation
- Output Dimension: 1024D (full), with Matryoshka truncation
- Alignment Target: lumees/lumees-matryoshka-embedding-v1
Training Hyperparameters
- Optimizer: AdamW (lr=1e-4, weight_decay=0.01, betas=(0.9, 0.98))
- Batch Size: 128
- Epochs: 10
- Warmup: 2000 steps
- Loss Function: InfoNCE (temperature=0.07)
- Mixed Precision: Enabled (FP16)
- Gradient Clipping: Max norm 1.0
Matryoshka Loss Weighting
Larger dimensions receive more weight during training:
| Dimension | Weight |
|---|---|
| 1024D | 50.4% |
| 768D | 25.2% |
| 512D | 12.6% |
| 384D | 6.3% |
| 256D | 3.1% |
| 128D | 1.6% |
| 64D | 0.8% |
Dataset
- Sources: COCO Captions + Conceptual Captions 3M
- Training Strategy: Contrastive learning with frozen text encoder
- Data Augmentation: Standard image preprocessing only
Use Cases
Mobile & Edge Devices (64D-128D)
- Real-time image search on smartphones
- IoT applications
- Resource-constrained environments
- Benefits: 8-16x faster, 8-16x less storage
Balanced Production (512D) ⭐
- General-purpose image retrieval
- Content recommendation systems
- Multimodal search engines
- Benefits: 2x faster than 1024D, maintains 100% accuracy
High-Precision Applications (768D-1024D)
- Research and benchmarking
- Fine-grained similarity search
- Quality-critical systems
- Benefits: Maximum accuracy
Dimension Selection Guide
| Use Case | Recommended Dim | Latency | Storage | Accuracy |
|---|---|---|---|---|
| Mobile apps | 128D | 8x faster | 8x less | 100% |
| Web search | 512D | 2x faster | 2x less | 100% |
| Research | 1024D | baseline | baseline | 100% |
Model Files
pytorch_model.bin- Model weights (1.2GB)config.json- Model configurationmodel.py- Model architecture codeREADME.md- This file
Limitations
- Image Resolution: Fixed at 224×224 pixels
- Domain: Optimized for natural images (photos)
- Training Data: Web-scraped data may contain biases
- Language: Text model supports 100+ languages, but vision training was primarily English-captioned
Citation
@misc{vision-matryoshka-vit-large,
title={Vision Matryoshka Embeddings with ViT-Large},
author={Hasan Kurşun and Kerem Berkay Yanık},
year={2025},
organization={Lumees},
howpublished={\url{https://huggingface.co/lumees/lumees-matryoshka-vision-embedding-v1}}
}
@misc{lumees-matryoshka-embedding,
title={Matryoshka Text Embedding v1},
author={Hasan Kurşun and Kerem Berkay Yanık},
year={2025},
organization={Lumees},
url={https://huggingface.co/lumees/lumees-matryoshka-embedding-v1}
}
License
CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
- ✅ Free for research and non-commercial use
- ✅ Attribution required when using this model
- ❌ Commercial use prohibited without separate license
For commercial licensing, please contact the model authors.
This license matches the aligned text model: lumees/lumees-matryoshka-embedding-v1
Acknowledgments
- Text Alignment: lumees/lumees-matryoshka-embedding-v1
- Vision Backbone: OpenAI CLIP ViT-Large/14
- Training Methodology: Matryoshka Representation Learning (Kusupati et al., 2022)
- Framework: PyTorch, timm, sentence-transformers
Model Card Contact
For questions or issues, please open an issue in the model repository.
- Downloads last month
- 6