Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Gemma-4-26B-A4B-it-PolarQuant-Q5

25.2B MoE (3.8B active) + Vision β€” PQ5 quantized weights including ALL MoE experts.

Download: 26.9 GB (vs 51.6 GB BF16 original)

Metric Value
Download 26.9 GB (1.9x smaller)
Quantized 427 linear + 7,680 MoE experts
Architecture 30 layers, 128 experts (top-8)
Vision βœ… Image+Text β†’ Text
Routers FP16 (exact expert selection)

πŸ“Š Charts

Download Size Quantization Coverage Gemma Family

πŸš€ Quick Start

Expert Offloading (8.6 GB GPU β€” best for consumer GPUs)

pip install vllm --upgrade
from vllm import LLM, SamplingParams
llm = LLM('google/gemma-4-26B-A4B-it', dtype='bfloat16',
          moe_expert_cache_size=8, enforce_eager=True,
          kernel_config={'moe_backend': 'triton'})

Streaming Loader (PQ5 dequant + INT4)

See POLARQUANT_GEMMA4_26B_A4B_VISION.ipynb

πŸ† GPU Support

GPU Method VRAM
T4 (16 GB) Expert offloading 8.6 GB
RTX 4090 (24 GB) Expert offloading 8.6 GB
A100 (80 GB) Full load + PQ5 dequant ~50 GB

πŸ““ Notebooks

Notebook Description
MoE Quantize PQ5 quantize all experts + save codes
Vision Inference Multimodal streaming loader
Expert Offload vLLM fork, 14.8 tok/s

πŸ”§ Technical Details

  • MoE experts: 3D nn.Parameter (128, out, in) β€” each expert quantized independently
  • gate_up_proj: (128, 1408, 2816) per layer
  • down_proj: (128, 2816, 704) per layer
  • 128 experts Γ— 2 params Γ— 30 layers = 7,680 expert quantizations
  • Quantization time: 50 seconds on A100
  • PQ5 codes: int8 + fp16 norms + Hadamard rotation + Lloyd-Max centroids

πŸ“– Citation

@article{polarquant2025,
  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2025}
}

πŸ“„ Paper Β· πŸ’» GitHub Β· πŸ“¦ pip install polarquant


πŸš€ Quick Start

Install

pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

from polarengine_vllm import PolarQuantModel

model = PolarQuantModel.from_pretrained("caiovicentino1/Gemma-4-26B-A4B-it-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

model = PolarQuantModel.from_pretrained("caiovicentino1/Gemma-4-26B-A4B-it-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory β€” fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

polarquant bench caiovicentino1/Gemma-4-26B-A4B-it-PolarQuant-Q5 --ppl --chart

Gradio Demo

polarquant demo caiovicentino1/Gemma-4-26B-A4B-it-PolarQuant-Q5 --share

πŸ“¦ Method: PolarQuant

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest β€” mathematically proven optimal for Gaussian-distributed neural network weights.

PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

πŸ”— Links

Downloads last month
286
Safetensors
Model size
27B params
Tensor type
F32
Β·
F16
Β·
I8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/Gemma-4-26B-A4B-it-PolarQuant-Q5

Quantized
(91)
this model

Collections including caiovicentino1/Gemma-4-26B-A4B-it-PolarQuant-Q5

Paper for caiovicentino1/Gemma-4-26B-A4B-it-PolarQuant-Q5