Update 2025-05-06: Replaced chat_template in tokenizer_config.json with the fixed version from froggeric/Qwen-Fixed-Chat-Templates.

Huihui-Qwen3.6-35B-A3B-abliterated-FP8

Vision-capable FP8-quantized abliterated Qwen3.6-35B-A3B (MoE, hybrid mamba/attention) for Nvidia DGX Spark and other FP8-capable hardware (~80 GB VRAM for full 262k context).

I've tested many abliterated models from HF, and only Huihui makes really good ones. Check "Claude" version if you like: batsclamp/Huihui-Qwen3.6-35B-A3B-Claude-4.6-Opus-abliterated-FP8

This one will give you ±50tps on full context when used with Eugr's vLLM (DGX Spark)

Model Lineage

Why FP8

Qwen3.6-35B-A3B in BF16 is ~72 GB on disk. FP8 cuts that to ~37 GB while preserving vision layers and precision-sensitive modules in BF16. The expected throughput uplift on DGX Spark is on par with what we saw for Qwen3.5 (31 → 51 t/s, ~65%).

Quantization Details

Scheme: native FP8 blockwise, identical on-disk format to the official Qwen/Qwen3.6-35B-A3B-FP8.

Field Value
quant_method fp8
activation_scheme dynamic (per-token, at inference)
fmt e4m3
weight_block_size [128, 128]
Scale dtype / key bf16, *.weight_scale_inv
Scale shape (ceil(out/128), ceil(in/128))

Quantized (weights → FP8 e4m3, per-block [128, 128] scales):

  • All 2D Linear *.weight in language layers that aren't in the exclusion list, including:
    • self_attn.{q,k,v,o}_proj (full-attention layers)
    • linear_attn.{in_proj_qkv, in_proj_z, out_proj} (linear-attention / mamba layers)
    • mlp.shared_expert.{gate,up,down}_proj
    • All 256 experts per MoE layer, un-fused to match Qwen's official per-expert layout:
      • mlp.experts.{0..255}.{gate_proj, up_proj, down_proj}.weight

Kept in BF16 (matches Qwen's modules_to_not_convert):

Module Reason
lm_head Output head — precision-sensitive
model.language_model.embed_tokens Embedding layer
*.input_layernorm, *.post_attention_layernorm LayerNorms
*.self_attn.{q_norm, k_norm} QK norms
*.linear_attn.{A_log, conv1d, dt_bias, in_proj_a, in_proj_b, in_proj_ba, norm} Mamba state-space params (small, sensitive)
*.mlp.gate, *.mlp.shared_expert_gate MoE router gates — routing precision matters
model.visual.* Entire visual encoder (patch_embed, 27 ViT blocks, deepstack mergers, merger)
mtp.* Multi-token prediction module

Notable Implementation Notes

  • Source experts were fused 3D (mlp.experts.gate_up_proj[256, 1024, 2048], mlp.experts.down_proj[256, 2048, 512]) — we un-fuse them to the per-expert layout the official Qwen FP8 uses (mlp.experts.{E}.{gate, up, down}_proj.weight). This is what vLLM's Fp8 MoE loader expects.
  • Streaming quantization: processed one source shard at a time on the GPU; peak host memory ~6 GB. Avoids the llmcompressor pitfall where peak VM grew to 168 GB during the Compressing phase and got OOM-killed on the 128 GB DGX Spark unified-memory budget.
  • Sanity check: round-trip dequantization median relative error ~2.2% per tensor (as expected for E4M3 blockwise).

Numbers

BF16 source This FP8
Size on disk ~72 GB ~37 GB
Tensors in index 1045 (fused experts) 64189 (un-fused)
FP8 weight tensors 31738
BF16 weight tensors 32451 (incl. 31738 weight_scale_inv)

Loading

from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
    "batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8",
    dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8")

For vLLM, point at the repo — the quantization_config is already correctly set (quant_method: fp8, weight-block [128, 128], dynamic activations).

Disclaimer

Abliterated model. Not recommended if you expect a polite corporate assistant.

Downloads last month
24
Safetensors
Model size
36B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gsting/Qwen3.6-35B-A3B-abliterated-FP8

Quantized
(439)
this model

Collection including gsting/Qwen3.6-35B-A3B-abliterated-FP8