Huihui-Qwopus3.5-27B-v3-abliterated-NVFP4

English

Quantized on 2026-04-13 with lm_head / linear_attn (GDN) / mtp / visual / embed_tokens preserved in BF16.

NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+

As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.

If accuracy and inference speed are your priority, we recommend the INT4 AutoRound version: 👉 YuYu1015/Huihui-Qwopus3.5-27B-v3-abliterated-int4-AutoRound

INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.

NVFP4 quantization of huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121).

Model Details

Item	Value
Architecture	Dense 27B + GDN (Mamba) + Attention hybrid
Base model	Qwen/Qwen3.5-27B
Fine-tuned by	huihui-ai (Qwopus v3 distillation + abliteration)
Quantized by	YuYu1015
Model size	~25 GB (NVFP4, vs ~51 GB BF16 original)
Context length	Up to 262,144 tokens
Thinking mode	Supported (`enable_thinking: true/false`)
Tool calling	Supported (`qwen3_coder` parser)
MTP	Built-in MTP weights included (preserved in BF16)

Quantization Details

Item	Value
Method	llm-compressor (main branch, PR #2608)
Scheme	NVFP4 (E2M1 + FP8 per-group scaling, group size 16)
Format	compressed-tensors (main branch)
Calibration dataset	HuggingFaceH4/ultrachat_200k (`train_sft` split)
Calibration samples	512
Calibration sequence length	2048
Hardware	NVIDIA DGX Spark (GB10, 128GB unified memory)
Environment	`transformers>=5.0` + `llm-compressor` main (Qwen3.5 `qwen3_5` model_type requires tf5)

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Layer	Reason
`lm_head`	Output head, sensitive to quantization noise
`re:.linear_attn.`	GDN/DeltaNet (Mamba) layers — may output zeros if quantized
`re:.mtp\..`	Multi-Token Prediction weights
`re:.visual\..`	Vision encoder
`re:.*embed_tokens$`	Input embeddings

Serving with vLLM

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.5-27b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.90 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing cvt.e2m1x2 instruction)
FP8 KV cache is not compatible with GDN non-causal attention layers; use --kv-cache-dtype auto
--language-model-only skips vision encoder profiling for text-only inference
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

Credits

Original Model: Qwen/Qwen3.5-27B by Alibaba Qwen Team
Fine-tuning & Abliteration: huihui-ai
NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: llm-compressor by vLLM Project

繁體中文

2026-04-13 量化上傳，lm_head / linear_attn (GDN) / mtp / visual / embed_tokens 保留 BF16。

NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+

截至 2026 年 4 月，NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16（BF16 activation），FP4 的理論吞吐量優勢無法發揮。

若精度與推理速度為首要考量，建議改用 INT4 AutoRound 版本： 👉 YuYu1015/Huihui-Qwopus3.5-27B-v3-abliterated-int4-AutoRound

INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑，校準更完整（品質保留約 99.5%），效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後，NVFP4 的真正優勢才能發揮。

huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated 的 NVFP4 量化版本，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化。

模型資訊

項目	數值
架構	Dense 27B + GDN (Mamba) + Attention 混合
基礎模型	Qwen/Qwen3.5-27B
微調者	huihui-ai（Qwopus v3 蒸餾 + abliteration）
量化者	YuYu1015
模型大小	~25 GB（NVFP4，原版 BF16 約 51 GB）
Context 長度	最高 262,144 tokens
思考模式	支援（`enable_thinking: true/false`）
工具呼叫	支援（`qwen3_coder` parser）
MTP	內建 MTP 權重（保留 BF16）

量化詳情

項目	數值
方法	llm-compressor（main 分支，PR #2608）
方案	NVFP4（E2M1 + FP8 逐群縮放，群組大小 16）
格式	compressed-tensors（main 分支）
校準資料集	HuggingFaceH4/ultrachat_200k (`train_sft` 分割)
校準樣本數	512
校準序列長度	2048
量化硬體	NVIDIA DGX Spark（GB10, 128GB 統一記憶體）
環境	`transformers>=5.0` + `llm-compressor` main（Qwen3.5 `qwen3_5` model_type 需要 tf5）

保留 BF16 的層

以下層未被量化以保持模型品質：

層	原因
`lm_head`	輸出頭，對量化雜訊敏感
`re:.linear_attn.`	GDN/DeltaNet (Mamba) 層，量化後可能輸出零
`re:.mtp\..`	Multi-Token Prediction 權重
`re:.visual\..`	視覺編碼器
`re:.*embed_tokens$`	輸入嵌入

vLLM 部署

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.5-27b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.90 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only