Huihui-Qwopus3.5-27B-v3-abliterated-NVFP4
English
Quantized on 2026-04-13 with
lm_head/linear_attn(GDN) /mtp/visual/embed_tokenspreserved in BF16.
NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+
As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.
If accuracy and inference speed are your priority, we recommend the INT4 AutoRound version: 👉 YuYu1015/Huihui-Qwopus3.5-27B-v3-abliterated-int4-AutoRound
INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.
NVFP4 quantization of huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121).
Model Details
| Item | Value |
|---|---|
| Architecture | Dense 27B + GDN (Mamba) + Attention hybrid |
| Base model | Qwen/Qwen3.5-27B |
| Fine-tuned by | huihui-ai (Qwopus v3 distillation + abliteration) |
| Quantized by | YuYu1015 |
| Model size | ~25 GB (NVFP4, vs ~51 GB BF16 original) |
| Context length | Up to 262,144 tokens |
| Thinking mode | Supported (enable_thinking: true/false) |
| Tool calling | Supported (qwen3_coder parser) |
| MTP | Built-in MTP weights included (preserved in BF16) |
Quantization Details
| Item | Value |
|---|---|
| Method | llm-compressor (main branch, PR #2608) |
| Scheme | NVFP4 (E2M1 + FP8 per-group scaling, group size 16) |
| Format | compressed-tensors (main branch) |
| Calibration dataset | HuggingFaceH4/ultrachat_200k (train_sft split) |
| Calibration samples | 512 |
| Calibration sequence length | 2048 |
| Hardware | NVIDIA DGX Spark (GB10, 128GB unified memory) |
| Environment | transformers>=5.0 + llm-compressor main (Qwen3.5 qwen3_5 model_type requires tf5) |
Layers Preserved in BF16
The following layers are not quantized to preserve model quality:
| Layer | Reason |
|---|---|
lm_head |
Output head, sensitive to quantization noise |
re:.*linear_attn.* |
GDN/DeltaNet (Mamba) layers — may output zeros if quantized |
re:.*mtp\..* |
Multi-Token Prediction weights |
re:.*visual\..* |
Vision encoder |
re:.*embed_tokens$ |
Input embeddings |
Serving with vLLM
vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3.5-27b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.90 \
--max-model-len 65536 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code \
--language-model-only
DGX Spark (SM121) Compatibility Notes
- NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing
cvt.e2m1x2instruction) - FP8 KV cache is not compatible with GDN non-causal attention layers; use
--kv-cache-dtype auto --language-model-onlyskips vision encoder profiling for text-only inference- Clear page cache before starting on UMA:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Safety Warning
This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.
Credits
- Original Model: Qwen/Qwen3.5-27B by Alibaba Qwen Team
- Fine-tuning & Abliteration: huihui-ai
- NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
- Quantization Tool: llm-compressor by vLLM Project
繁體中文
2026-04-13 量化上傳,
lm_head/linear_attn(GDN) /mtp/visual/embed_tokens保留 BF16。
NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+
截至 2026 年 4 月,NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16(BF16 activation),FP4 的理論吞吐量優勢無法發揮。
若精度與推理速度為首要考量,建議改用 INT4 AutoRound 版本: 👉 YuYu1015/Huihui-Qwopus3.5-27B-v3-abliterated-int4-AutoRound
INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑,校準更完整(品質保留約 99.5%),效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後,NVFP4 的真正優勢才能發揮。
huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated 的 NVFP4 量化版本,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化。
模型資訊
| 項目 | 數值 |
|---|---|
| 架構 | Dense 27B + GDN (Mamba) + Attention 混合 |
| 基礎模型 | Qwen/Qwen3.5-27B |
| 微調者 | huihui-ai(Qwopus v3 蒸餾 + abliteration) |
| 量化者 | YuYu1015 |
| 模型大小 | ~25 GB(NVFP4,原版 BF16 約 51 GB) |
| Context 長度 | 最高 262,144 tokens |
| 思考模式 | 支援(enable_thinking: true/false) |
| 工具呼叫 | 支援(qwen3_coder parser) |
| MTP | 內建 MTP 權重(保留 BF16) |
量化詳情
| 項目 | 數值 |
|---|---|
| 方法 | llm-compressor(main 分支,PR #2608) |
| 方案 | NVFP4(E2M1 + FP8 逐群縮放,群組大小 16) |
| 格式 | compressed-tensors(main 分支) |
| 校準資料集 | HuggingFaceH4/ultrachat_200k (train_sft 分割) |
| 校準樣本數 | 512 |
| 校準序列長度 | 2048 |
| 量化硬體 | NVIDIA DGX Spark(GB10, 128GB 統一記憶體) |
| 環境 | transformers>=5.0 + llm-compressor main(Qwen3.5 qwen3_5 model_type 需要 tf5) |
保留 BF16 的層
以下層未被量化以保持模型品質:
| 層 | 原因 |
|---|---|
lm_head |
輸出頭,對量化雜訊敏感 |
re:.*linear_attn.* |
GDN/DeltaNet (Mamba) 層,量化後可能輸出零 |
re:.*mtp\..* |
Multi-Token Prediction 權重 |
re:.*visual\..* |
視覺編碼器 |
re:.*embed_tokens$ |
輸入嵌入 |
vLLM 部署
vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3.5-27b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.90 \
--max-model-len 65536 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code \
--language-model-only
DGX Spark (SM121) 相容性說明
- NVFP4 在 SM121 上會退回 W4A16(原生 W4A4 路徑尚未支援,缺少
cvt.e2m1x2指令) - FP8 KV cache 與 GDN non-causal attention 不相容,請使用
--kv-cache-dtype auto --language-model-only跳過視覺編碼器 profiling,加速純文字推理啟動- UMA 架構啟動前請先清除 page cache:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
安全警告
此模型已移除安全過濾機制(abliterated),可能產生不當內容。使用者須自行承擔所有風險與法律責任。
致謝
- 原始模型:Qwen/Qwen3.5-27B,Alibaba Qwen 團隊
- 微調與去審查:huihui-ai
- NVFP4 量化:YuYu1015,於 NVIDIA DGX Spark (GB10) 上完成
- 量化工具:llm-compressor,vLLM Project
- Downloads last month
- 97