TomoroAI/tomoro-ai-colqwen3-embed-4b-awq

Overview

This is a W4A16 quantized version of TomoroAI/tomoro-colqwen3-embed-4b, a state-of-the-art ColPali-style multimodal embedding model. The quantization was performed using AutoRound with AutoAWQ backend.

The quantized model achieves ~3.5 GB memory usage (vs 8.4 GB for the original), enabling deployment on consumer GPUs while maintaining competitive retrieval performance.

Model Details

Property Value
Original Model TomoroAI/tomoro-colqwen3-embed-4b
Parameters 4.0B
Quantization W4A16 (4-bit weights, 16-bit activations)
Quantization Method AutoRound with AutoAWQ backend
Calibration Sequence Length 1024
Memory Usage (Quantized) ~3.5 GB
Memory Usage (Original) 8.4 GB
Embedding Dimension 320
Max Visual Tokens 1280

Quantization Configuration

Parameter Value
Bits 4
Group Size 128
Symmetric True
Calibration Dataset NeelNanda/pile-10k (AutoRound default)
Calibration Sequence Length 1024
Iterations 1000
Number of Samples 560
Batch Size 80
Quantized Layers 252
FP16 Layers (Vision) 105

Note: Only the text tower (language model) is quantized. The vision encoder remains in FP16/BF16 to preserve visual feature quality.

Performance

NDCG@5 on ViDoRe Benchmark (All Languages)

Model Average NDCG@5 Change
Original (FP16) 0.70023 -
This Model (W4A16, seqlen=1024) 0.69768 -0.36%

NDCG@5 on ViDoRe Benchmark (English Only)

Model Average NDCG@5 Change
Original (FP16) 0.74743 -
This Model (W4A16, seqlen=1024) 0.74582 -0.21%

Performance Summary

  • Benchmarks Improved: 17
  • Benchmarks Degraded: 23
  • Overall Quality Retention: ~99.6%

Benchmark Comparison Charts

Note: Here, "seqlen" refers to the calibration dataset sequence length used during quantization, not the maximum sequence length supported by the original model. The model retains the full sequence length of the original, but quantization statistics are collected with the calibration seqlen shown.

Performance Comparison (All Languages)

Performance Comparison - All Languages

Performance Difference vs Original (All Languages)

Performance Difference - All Languages

Performance Comparison (English Only)

Performance Comparison - English

Performance Difference vs Original (English Only)

Performance Difference - English

Memory Efficiency

The quantized model enables deployment on GPUs with limited memory:

GPU Memory Original Model Quantized Model
8 GB Marginal Fits with batch size ~64
12 GB Fits comfortably Fits with batch size ~256
16 GB Fits comfortably High batch sizes possible
24 GB Fits comfortably High batch sizes possible

Usage

Prerequisites

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install auto-round==0.9.2
pip install autoawq==0.2.9
pip install transformers pillow requests
pip install flash-attn --no-build-isolation  # Optional but recommended

Inference Code

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
from io import BytesIO

# Configuration
MODEL_ID = "TomoroAI/tomoro-ai-colqwen3-embed-4b-awq"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load Model & Processor
processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
    MODEL_ID,
    dtype=DTYPE,
    attn_implementation="sdpa",  # Use "flash_attention_2" if available
    trust_remote_code=True,
    device_map=DEVICE,
).eval()

# Sample queries and documents
queries = [
    "Retrieve the city of Singapore",
    "Retrieve the city of Beijing",
]
doc_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
]

def load_image(url: str) -> Image.Image:
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers, timeout=10)
    resp.raise_for_status()
    return Image.open(BytesIO(resp.content)).convert("RGB")

def encode_queries(texts):
    batch = processor.process_texts(texts=texts)
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    with torch.inference_mode():
        out = model(**batch)
    return out.embeddings.to(torch.bfloat16).cpu()

def encode_docs(urls):
    images = [load_image(url) for url in urls]
    features = processor.process_images(images=images)
    features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
    with torch.inference_mode():
        out = model(**features)
    return out.embeddings.to(torch.bfloat16).cpu()

# Encode and score
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(doc_urls)
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)

Comparison with Other Calibration Lengths

Calibration Length Avg NDCG@5 Delta Best For
seqlen=256 0.69611 -0.59% Short document retrieval
seqlen=512 0.69696 -0.47% Balanced use cases
seqlen=1024 0.69768 -0.36% Long document retrieval

Limitations

  • Reduced Precision: 4-bit quantization introduces some accuracy loss compared to the original FP16 model.
  • Vision Encoder: The vision encoder is not quantized to preserve visual feature quality.
  • Inference Backend: Performance depends on the inference backend (AutoAWQ, vLLM, etc.).

License

This model is released under the Apache 2.0 License, consistent with the original model.

Acknowledgements

Citation

If you use this model, please cite both the original model and this quantized version:

@misc{huang2025beyond,
  author = {Huang, Xin and Tan, Kye Min},
  title = {Beyond Text: Unlocking True Multimodal, End-to-end RAG with Tomoro ColQwen3},
  year = {2025},
  url = {https://tomoro.ai/insights/beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3},
  publisher = {Tomoro.ai}
}

@misc{autoround,
  author = {Intel Corporation},
  title = {AutoRound: Advanced Weight-Only Quantization Algorithm},
  year = {2024},
  url = {https://github.com/intel/auto-round}
}
Downloads last month
22
Safetensors
Model size
1B params
Tensor type
I32
BF16
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for TomoroAI/tomoro-ai-colqwen3-embed-4b-awq

Quantized
(1)
this model