TomoroAI/tomoro-ai-colqwen3-embed-4b-awq
Overview
This is a W4A16 quantized version of TomoroAI/tomoro-colqwen3-embed-4b, a state-of-the-art ColPali-style multimodal embedding model. The quantization was performed using AutoRound with AutoAWQ backend.
The quantized model achieves ~3.5 GB memory usage (vs 8.4 GB for the original), enabling deployment on consumer GPUs while maintaining competitive retrieval performance.
Model Details
| Property | Value |
|---|---|
| Original Model | TomoroAI/tomoro-colqwen3-embed-4b |
| Parameters | 4.0B |
| Quantization | W4A16 (4-bit weights, 16-bit activations) |
| Quantization Method | AutoRound with AutoAWQ backend |
| Calibration Sequence Length | 1024 |
| Memory Usage (Quantized) | ~3.5 GB |
| Memory Usage (Original) | 8.4 GB |
| Embedding Dimension | 320 |
| Max Visual Tokens | 1280 |
Quantization Configuration
| Parameter | Value |
|---|---|
| Bits | 4 |
| Group Size | 128 |
| Symmetric | True |
| Calibration Dataset | NeelNanda/pile-10k (AutoRound default) |
| Calibration Sequence Length | 1024 |
| Iterations | 1000 |
| Number of Samples | 560 |
| Batch Size | 80 |
| Quantized Layers | 252 |
| FP16 Layers (Vision) | 105 |
Note: Only the text tower (language model) is quantized. The vision encoder remains in FP16/BF16 to preserve visual feature quality.
Performance
NDCG@5 on ViDoRe Benchmark (All Languages)
| Model | Average NDCG@5 | Change |
|---|---|---|
| Original (FP16) | 0.70023 | - |
| This Model (W4A16, seqlen=1024) | 0.69768 | -0.36% |
NDCG@5 on ViDoRe Benchmark (English Only)
| Model | Average NDCG@5 | Change |
|---|---|---|
| Original (FP16) | 0.74743 | - |
| This Model (W4A16, seqlen=1024) | 0.74582 | -0.21% |
Performance Summary
- Benchmarks Improved: 17
- Benchmarks Degraded: 23
- Overall Quality Retention: ~99.6%
Benchmark Comparison Charts
Note: Here, "seqlen" refers to the calibration dataset sequence length used during quantization, not the maximum sequence length supported by the original model. The model retains the full sequence length of the original, but quantization statistics are collected with the calibration seqlen shown.
Performance Comparison (All Languages)
Performance Difference vs Original (All Languages)
Performance Comparison (English Only)
Performance Difference vs Original (English Only)
Memory Efficiency
The quantized model enables deployment on GPUs with limited memory:
| GPU Memory | Original Model | Quantized Model |
|---|---|---|
| 8 GB | Marginal | Fits with batch size ~64 |
| 12 GB | Fits comfortably | Fits with batch size ~256 |
| 16 GB | Fits comfortably | High batch sizes possible |
| 24 GB | Fits comfortably | High batch sizes possible |
Usage
Prerequisites
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install auto-round==0.9.2
pip install autoawq==0.2.9
pip install transformers pillow requests
pip install flash-attn --no-build-isolation # Optional but recommended
Inference Code
import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
from io import BytesIO
# Configuration
MODEL_ID = "TomoroAI/tomoro-ai-colqwen3-embed-4b-awq"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load Model & Processor
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
MODEL_ID,
dtype=DTYPE,
attn_implementation="sdpa", # Use "flash_attention_2" if available
trust_remote_code=True,
device_map=DEVICE,
).eval()
# Sample queries and documents
queries = [
"Retrieve the city of Singapore",
"Retrieve the city of Beijing",
]
doc_urls = [
"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
]
def load_image(url: str) -> Image.Image:
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers, timeout=10)
resp.raise_for_status()
return Image.open(BytesIO(resp.content)).convert("RGB")
def encode_queries(texts):
batch = processor.process_texts(texts=texts)
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.inference_mode():
out = model(**batch)
return out.embeddings.to(torch.bfloat16).cpu()
def encode_docs(urls):
images = [load_image(url) for url in urls]
features = processor.process_images(images=images)
features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
with torch.inference_mode():
out = model(**features)
return out.embeddings.to(torch.bfloat16).cpu()
# Encode and score
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(doc_urls)
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)
Comparison with Other Calibration Lengths
| Calibration Length | Avg NDCG@5 | Delta | Best For |
|---|---|---|---|
| seqlen=256 | 0.69611 | -0.59% | Short document retrieval |
| seqlen=512 | 0.69696 | -0.47% | Balanced use cases |
| seqlen=1024 | 0.69768 | -0.36% | Long document retrieval |
Limitations
- Reduced Precision: 4-bit quantization introduces some accuracy loss compared to the original FP16 model.
- Vision Encoder: The vision encoder is not quantized to preserve visual feature quality.
- Inference Backend: Performance depends on the inference backend (AutoAWQ, vLLM, etc.).
License
This model is released under the Apache 2.0 License, consistent with the original model.
Acknowledgements
- Original Model: TomoroAI/tomoro-colqwen3-embed-4b by Tomoro AI
- Quantization Tool: AutoRound by Intel
- Base Architecture: Qwen3-VL by Alibaba
Citation
If you use this model, please cite both the original model and this quantized version:
@misc{huang2025beyond,
author = {Huang, Xin and Tan, Kye Min},
title = {Beyond Text: Unlocking True Multimodal, End-to-end RAG with Tomoro ColQwen3},
year = {2025},
url = {https://tomoro.ai/insights/beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3},
publisher = {Tomoro.ai}
}
@misc{autoround,
author = {Intel Corporation},
title = {AutoRound: Advanced Weight-Only Quantization Algorithm},
year = {2024},
url = {https://github.com/intel/auto-round}
}
- Downloads last month
- 22
Model tree for TomoroAI/tomoro-ai-colqwen3-embed-4b-awq
Base model
Qwen/Qwen3-VL-4B-Instruct


