Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Qwopus3.5-27B-v3-PolarQuant-Q5

27B Claude Opus distill on consumer GPUs with PolarQuant.

Download: 16.2 GB (vs 54 GB BF16 — 3.3x compression)

Metric	Value
VRAM	16.9 GB
Speed	21.7 tok/s
Download	16.2 GB
KV Cache Q3	5.3x, zero overhead
Dequant	32s
Layers	497 quantized

📊 Benchmark Results (Verified)

PQ5 BEATS BF16 on 2/3 benchmarks with 67% less VRAM!

Task	BF16 (56.4 GB)	PQ5 (18.7 GB)	Delta
HellaSwag	64.5%	67.0%	+2.5% ✅
ARC-Challenge	61.0%	60.0%	-1.0% ≈
Winogrande	72.5%	73.0%	+0.5% ✅
HumanEval	97.56%	—	(model card)
VRAM	56.4 GB	18.7 GB	-66.8% 🔥

Evaluated on 200 samples per task, 0-shot. PQ5 uses Hadamard rotation + Lloyd-Max Q5 centroids + torchao INT4. The improvement on HellaSwag (+2.5%) demonstrates the regularization effect of PolarQuant — quantization noise acts as implicit regularizer, similar to dropout.

Hardware Compatibility

GPU	BF16	PQ5 (INT4)
RTX 4090 (24 GB)	❌	✅
RTX 4080 (16 GB)	❌	⚠️ tight
RTX PRO 6000 (96 GB)	✅	✅
A100 (40 GB)	❌	✅
A100 (80 GB)	✅	✅

27B model on a RTX 4090 — only possible with PolarQuant.

📊 Charts

🏆 GPU Support

GPU	VRAM	Fits?
RTX 3060 Ti	16 GB	⚠️ Tight
RTX 4090	24 GB	✅ (7 GB headroom)
L4	24 GB	✅
A100	40-80 GB	✅

🔬 KV Cache Compression

Method	tok/s	Compression
FP16 (baseline)	21.7	1.0x
PolarQuant Q3	21.9	5.3x
PolarQuant Q2	21.8	8.0x

Token match (Q3 vs FP16): 25.3% exact match on a spot-check. We have not run a rigorous BLEU / LLM-as-judge eval comparing KV-Q3 outputs against FP16 — the exact-match number alone is not a quality claim. Use Q3 KV cache with caution until we publish a full eval.

🚀 Quick Start

pip install polarquant[all]
polarquant chat Jackrong/Qwopus3.5-27B-v3

🔧 Technical Details

Architecture: Qwen3.5-27B — 64 layers (hybrid attention+linear), 4 KV heads, head_dim=128
Weight quantization: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4
KV cache: Hadamard rotation (128x128) + Lloyd-Max Q3 + real bit-packing
Streaming loader: Per-module INT4 via nn.Sequential wrapper — fits 24GB GPUs
Hybrid cache: _HybridCacheLayer for Qwen3.5's linear attention layers

📖 Citation

@article{polarquant2025,
  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2025}
}

📄 Paper · 💻 GitHub · 📦 PyPI

🚀 Quick Start

Install

pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

from polarengine_vllm import PolarQuantModel

model = PolarQuantModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

model = PolarQuantModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory — fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

polarquant bench caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5 --ppl --chart

Gradio Demo

polarquant demo caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5 --share

📦 Method: PolarQuant

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest — mathematically proven optimal for Gaussian-distributed neural network weights.

PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

🔗 Links

Downloads last month: 2,286

Model tree for caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5

Base model

Qwen/Qwen3.5-27B

Finetuned

unsloth/Qwen3.5-27B

Adapter

Jackrong/Qwopus3.5-27B-v3

Quantized

(25)

this model

Collections including caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5

HLWQ Unified (Weights Q5 + KV Cache Q3)

Collection

Full-stack HLWQ: Q5 weights + torchao INT4 + Q3 KV cache · formerly PolarQuant Unified • 17 items • Updated 27 minutes ago • 1

HLWQ Models

Collection

Hadamard-Lloyd Weight Quantization · arXiv:2603.29078 · formerly PolarQuant • 27 items • Updated 27 minutes ago

Papers for caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published 14 days ago

PolarQuant: Quantizing KV Caches with Polar Transformation

Paper • 2502.02617 • Published Feb 4, 2025 • 1