Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
π§ Qwopus3.5-27B-v3-PolarQuant-Q5
27B Claude Opus distill on consumer GPUs with PolarQuant.
Download: 16.2 GB (vs 54 GB BF16 β 3.3x compression)
| Metric | Value |
|---|---|
| VRAM | 16.9 GB |
| Speed | 21.7 tok/s |
| Download | 16.2 GB |
| KV Cache Q3 | 5.3x, zero overhead |
| Dequant | 32s |
| Layers | 497 quantized |
π Benchmark Results (Verified)
PQ5 BEATS BF16 on 2/3 benchmarks with 67% less VRAM!
| Task | BF16 (56.4 GB) | PQ5 (18.7 GB) | Delta |
|---|---|---|---|
| HellaSwag | 64.5% | 67.0% | +2.5% β |
| ARC-Challenge | 61.0% | 60.0% | -1.0% β |
| Winogrande | 72.5% | 73.0% | +0.5% β |
| HumanEval | 97.56% | β | (model card) |
| VRAM | 56.4 GB | 18.7 GB | -66.8% π₯ |
Evaluated on 200 samples per task, 0-shot. PQ5 uses Hadamard rotation + Lloyd-Max Q5 centroids + torchao INT4. The improvement on HellaSwag (+2.5%) demonstrates the regularization effect of PolarQuant β quantization noise acts as implicit regularizer, similar to dropout.
Hardware Compatibility
| GPU | BF16 | PQ5 (INT4) |
|---|---|---|
| RTX 4090 (24 GB) | β | β |
| RTX 4080 (16 GB) | β | β οΈ tight |
| RTX PRO 6000 (96 GB) | β | β |
| A100 (40 GB) | β | β |
| A100 (80 GB) | β | β |
27B model on a RTX 4090 β only possible with PolarQuant.
π Charts
π GPU Support
| GPU | VRAM | Fits? |
|---|---|---|
| RTX 3060 Ti | 16 GB | β οΈ Tight |
| RTX 4090 | 24 GB | β (7 GB headroom) |
| L4 | 24 GB | β |
| A100 | 40-80 GB | β |
π¬ KV Cache Compression
| Method | tok/s | Compression |
|---|---|---|
| FP16 (baseline) | 21.7 | 1.0x |
| PolarQuant Q3 | 21.9 | 5.3x |
| PolarQuant Q2 | 21.8 | 8.0x |
Token match (Q3 vs FP16): 25.3% exact match on a spot-check. We have not run a rigorous BLEU / LLM-as-judge eval comparing KV-Q3 outputs against FP16 β the exact-match number alone is not a quality claim. Use Q3 KV cache with caution until we publish a full eval.
π Quick Start
pip install polarquant[all]
polarquant chat Jackrong/Qwopus3.5-27B-v3
π§ Technical Details
- Architecture: Qwen3.5-27B β 64 layers (hybrid attention+linear), 4 KV heads, head_dim=128
- Weight quantization: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4
- KV cache: Hadamard rotation (128x128) + Lloyd-Max Q3 + real bit-packing
- Streaming loader: Per-module INT4 via nn.Sequential wrapper β fits 24GB GPUs
- Hybrid cache: _HybridCacheLayer for Qwen3.5's linear attention layers
π Citation
@article{polarquant2025,
title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.29078},
year={2025}
}
π Paper Β· π» GitHub Β· π¦ PyPI
π Quick Start
Install
pip install git+https://github.com/caiovicentino/polarengine-vllm.git
Load & Generate (1 line!)
from polarengine_vllm import PolarQuantModel
model = PolarQuantModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))
With KV Cache Compression (5.3x more context)
model = PolarQuantModel.from_pretrained("caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory β fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))
Benchmark
polarquant bench caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5 --ppl --chart
Gradio Demo
polarquant demo caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-Q5 --share
π¦ Method: PolarQuant
Hadamard Rotation + Lloyd-Max Optimal Centroids
Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest β mathematically proven optimal for Gaussian-distributed neural network weights.
PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size
π Links
- Downloads last month
- 2,286


