TomoroAI/tomoro-ai-colqwen3-embed-4b-awq

Overview

This is a W4A16 quantized version of TomoroAI/tomoro-colqwen3-embed-4b, a state-of-the-art ColPali-style multimodal embedding model. The quantization was performed using AutoRound with AutoAWQ backend.

The quantized model achieves ~3.5 GB memory usage (vs 8.4 GB for the original), enabling deployment on consumer GPUs while maintaining competitive retrieval performance.

Model Details

Property	Value
Original Model	TomoroAI/tomoro-colqwen3-embed-4b
Parameters	4.0B
Quantization	W4A16 (4-bit weights, 16-bit activations)
Quantization Method	AutoRound with AutoAWQ backend
Calibration Sequence Length	1024
Memory Usage (Quantized)	~3.5 GB
Memory Usage (Original)	8.4 GB
Embedding Dimension	320
Max Visual Tokens	1280

Quantization Configuration

Parameter	Value
Bits	4
Group Size	128
Symmetric	True
Calibration Dataset	NeelNanda/pile-10k (AutoRound default)
Calibration Sequence Length	1024
Iterations	1000
Number of Samples	560
Batch Size	80
Quantized Layers	252
FP16 Layers (Vision)	105

Note: Only the text tower (language model) is quantized. The vision encoder remains in FP16/BF16 to preserve visual feature quality.

Performance

NDCG@5 on ViDoRe Benchmark (All Languages)

Model	Average NDCG@5	Change
Original (FP16)	0.70023	-
This Model (W4A16, seqlen=1024)	0.69768	-0.36%

NDCG@5 on ViDoRe Benchmark (English Only)

Model	Average NDCG@5	Change
Original (FP16)	0.74743	-
This Model (W4A16, seqlen=1024)	0.74582	-0.21%

Performance Summary

Benchmarks Improved: 17
Benchmarks Degraded: 23
Overall Quality Retention: ~99.6%

Benchmark Comparison Charts

Note: Here, "seqlen" refers to the calibration dataset sequence length used during quantization, not the maximum sequence length supported by the original model. The model retains the full sequence length of the original, but quantization statistics are collected with the calibration seqlen shown.

Performance Comparison (All Languages)

Performance Difference vs Original (All Languages)

Performance Comparison (English Only)

Performance Difference vs Original (English Only)

Memory Efficiency

The quantized model enables deployment on GPUs with limited memory:

GPU Memory	Original Model	Quantized Model
8 GB	Marginal	Fits with batch size ~64
12 GB	Fits comfortably	Fits with batch size ~256
16 GB	Fits comfortably	High batch sizes possible
24 GB	Fits comfortably	High batch sizes possible

Usage

Prerequisites

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install auto-round==0.9.2
pip install autoawq==0.2.9
pip install transformers pillow requests
pip install flash-attn --no-build-isolation  # Optional but recommended

Inference Code

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
from io import BytesIO

# Configuration
MODEL_ID = "TomoroAI/tomoro-ai-colqwen3-embed-4b-awq"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load Model & Processor
processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
    MODEL_ID,
    dtype=DTYPE,
    attn_implementation="sdpa",  # Use "flash_attention_2" if available
    trust_remote_code=True,
    device_map=DEVICE,
).eval()

# Sample queries and documents
queries = [
    "Retrieve the city of Singapore",
    "Retrieve the city of Beijing",
]
doc_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
]

def load_image(url: str) -> Image.Image:
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers, timeout=10)
    resp.raise_for_status()
    return Image.open(BytesIO(resp.content)).convert("RGB")

def encode_queries(texts):
    batch = processor.process_texts(texts=texts)
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    with torch.inference_mode():
        out = model(**batch)
    return out.embeddings.to(torch.bfloat16).cpu()

def encode_docs(urls):
    images = [load_image(url) for url in urls]
    features = processor.process_images(images=images)
    features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
    with torch.inference_mode():
        out = model(**features)
    return out.embeddings.to(torch.bfloat16).cpu()

# Encode and score
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(doc_urls)
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)

Comparison with Other Calibration Lengths

Calibration Length	Avg NDCG@5	Delta	Best For
seqlen=256	0.69611	-0.59%	Short document retrieval
seqlen=512	0.69696	-0.47%	Balanced use cases
seqlen=1024	0.69768	-0.36%	Long document retrieval

Limitations

Reduced Precision: 4-bit quantization introduces some accuracy loss compared to the original FP16 model.
Vision Encoder: The vision encoder is not quantized to preserve visual feature quality.
Inference Backend: Performance depends on the inference backend (AutoAWQ, vLLM, etc.).

License

This model is released under the Apache 2.0 License, consistent with the original model.

Acknowledgements

Original Model: TomoroAI/tomoro-colqwen3-embed-4b by Tomoro AI
Quantization Tool: AutoRound by Intel
Base Architecture: Qwen3-VL by Alibaba

Citation

If you use this model, please cite both the original model and this quantized version:

@misc{huang2025beyond,
  author = {Huang, Xin and Tan, Kye Min},
  title = {Beyond Text: Unlocking True Multimodal, End-to-end RAG with Tomoro ColQwen3},
  year = {2025},
  url = {https://tomoro.ai/insights/beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3},
  publisher = {Tomoro.ai}
}

@misc{autoround,
  author = {Intel Corporation},
  title = {AutoRound: Advanced Weight-Only Quantization Algorithm},
  year = {2024},
  url = {https://github.com/intel/auto-round}
}

Downloads last month: 22

Safetensors

Model size

1B params

Tensor type

I32

BF16

F16

Inference Providers NEW

Visual Document Retrieval

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomoroAI/tomoro-ai-colqwen3-embed-4b-awq

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

TomoroAI/tomoro-colqwen3-embed-4b

Quantized

(1)

this model