Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🍎 HLWQ MLX 4-bit β€” Qwopus3.5-9B-v3

HLWQ Q5 dequant β†’ MLX 4-bit for Apple Silicon inference.

PPL 6.44 β€” better than CUDA torchao INT4 (6.48), only +0.07 from FP16 baseline (6.37).

🎯 Key Results

Metric Value
Perplexity 6.44 (FP16: 6.37, CUDA INT4: 6.48, torchao absmax: 6.68)
Speed 20.7 tok/s (Mac mini M4 16GB)
Memory 5.1 GB peak
Format MLX 4-bit (4.5 bpw, group_size=64)
Size 4.7 GB

πŸ“Š Benchmark Comparison

Platform Method PPL ↓ tok/s Memory
RTX PRO 6000 Blackwell FP16 baseline 6.37 45.7 17.9 GB
Mac mini M4 16GB HLWQ MLX 4-bit 6.44 20.7 5.1 GB
RTX PRO 6000 Blackwell HLWQ Q5 + torchao INT4 6.48 43 7.1 GB
RTX PRO 6000 Blackwell torchao INT4 (absmax) 6.68 43.3 6.3 GB

MLX 4-bit beats CUDA torchao on PPL (6.44 vs 6.48) at half the memory (5.1 vs 7.1 GB).

PPL Comparison

Cross-Platform Performance

Quality vs Size

πŸš€ Quick Start

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit")
response = generate(
    model, tokenizer,
    prompt="What is the sum of the first 10 prime numbers? Think step by step.",
    max_tokens=500
)
print(response)

Or from CLI:

mlx_lm generate \
    --model caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit \
    --prompt "Explain quantum computing" \
    --max-tokens 300

πŸ”§ How It Was Made

Base model (BF16) β†’ HLWQ Q5 dequant (Hadamard + Lloyd-Max)
                   β†’ Save improved BF16 weights
                   β†’ mlx_lm convert --quantize --q-bits 4 --q-group-size 64

HLWQ dequant produces weights with lower quantization error than the original BF16. When MLX re-quantizes to 4-bit, it starts from a better baseline β†’ better final quality.

πŸ”¬ Why MLX Beats CUDA on PPL

MLX 4-bit with group_size=64 has finer granularity than torchao INT4 with group_size=128. Combined with HLWQ's improved starting weights, this gives the best PPL of any 4-bit method tested.

πŸ”— Resources

πŸ“– Citation

@misc{polarquant2025,
    title={HLWQ: Hadamard Rotation + Lloyd-Max Optimal Quantization for LLMs},
    author={Caio Vicentino},
    year={2025},
    url={https://github.com/caiovicentino/eoq-quantization}
}

πŸ™ Acknowledgements

Downloads last month
1,867
Safetensors
Model size
1B params
Tensor type
BF16
Β·
U32
Β·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit

Finetuned
Qwen/Qwen3.5-9B
Quantized
(14)
this model

Collections including caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit

Papers for caiovicentino1/Qwopus3.5-9B-v3-HLWQ-MLX-4bit