Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Qwopus3.5-27B-v3-PolarQuant-Q5

27B Claude Opus distill on consumer GPUs with PolarQuant.

Download: 16.2 GB (vs 54 GB BF16 β€” 3.3x compression)

Metric Value
VRAM 16.9 GB
Speed 21.7 tok/s
Download 16.2 GB
KV Cache Q3 5.3x, zero overhead
Dequant 32s
Layers 497 quantized

πŸ“Š Benchmark Results (Verified)

PQ5 BEATS BF16 on 2/3 benchmarks with 67% less VRAM!

Task BF16 (56.4 GB) PQ5 (18.7 GB) Delta
HellaSwag 64.5% 67.0% +2.5% βœ…
ARC-Challenge 61.0% 60.0% -1.0% β‰ˆ
Winogrande 72.5% 73.0% +0.5% βœ…
HumanEval 97.56% β€” (model card)
VRAM 56.4 GB 18.7 GB -66.8% πŸ”₯

Evaluated on 200 samples per task, 0-shot. PQ5 uses Hadamard rotation + Lloyd-Max Q5 centroids + torchao INT4. The improvement on HellaSwag (+2.5%) demonstrates the regularization effect of PolarQuant β€” quantization noise acts as implicit regularizer, similar to dropout.

Hardware Compatibility

GPU BF16 PQ5 (INT4)
RTX 4090 (24 GB) ❌ βœ…
RTX 4080 (16 GB) ❌ ⚠️ tight
RTX PRO 6000 (96 GB) βœ… βœ…
A100 (40 GB) ❌ βœ…
A100 (80 GB) βœ… βœ…

27B model on a RTX 4090 β€” only possible with PolarQuant.

πŸ“Š Charts

Compression KV Speed Context

πŸ† GPU Support

GPU VRAM Fits?
RTX 3060 Ti 16 GB ⚠️ Tight
RTX 4090 24 GB βœ… (7 GB headroom)
L4 24 GB βœ…
A100 40-80 GB βœ…

πŸ”¬ KV Cache Compression

Method tok/s Compression
FP16 (baseline) 21.7 1.0x
PolarQuant Q3 21.9 5.3x
PolarQuant Q2 21.8 8.0x

Token match (Q3 vs FP16): 25.3% exact match on a spot-check. We have not run a rigorous BLEU / LLM-as-judge eval comparing KV-Q3 outputs against FP16 β€” the exact-match number alone is not a quality claim. Use Q3 KV cache with caution until we publish a full eval.

πŸš€ Quick Start

pip install polarquant[all]
polarquant chat Jackrong/Qwopus3.5-27B-v3

πŸ”§ Technical Details

  • Architecture: Qwen3.5-27B β€” 64 layers (hybrid attention+linear), 4 KV heads, head_dim=128
  • Weight quantization: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4
  • KV cache: Hadamard rotation (128x128) + Lloyd-Max Q3 + real bit-packing
  • Streaming loader: Per-module INT4 via nn.Sequential wrapper β€” fits 24GB GPUs
  • Hybrid cache: _HybridCacheLayer for Qwen3.5's linear attention layers

πŸ“– Citation

@article{polarquant2025,
  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2025}
}

πŸ“„ Paper Β· πŸ’» GitHub Β· πŸ“¦ PyPI


πŸš€ Quick Start

Install

pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

from polarengine_vllm import PolarQuantModel

model = PolarQuantModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

model = PolarQuantModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory β€” fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

polarquant bench caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5 --ppl --chart

Gradio Demo

polarquant demo caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5 --share

πŸ“¦ Method: PolarQuant

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest β€” mathematically proven optimal for Gaussian-distributed neural network weights.

PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

πŸ”— Links

Downloads last month
2,286
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5

Base model

Qwen/Qwen3.5-27B
Quantized
(25)
this model

Collections including caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5

Papers for caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5