Huihui-Qwopus3.5-27B-v3-abliterated-NVFP4

English | 繁體中文


English

Quantized on 2026-04-13 with lm_head / linear_attn (GDN) / mtp / visual / embed_tokens preserved in BF16.

NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+

As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.

If accuracy and inference speed are your priority, we recommend the INT4 AutoRound version: 👉 YuYu1015/Huihui-Qwopus3.5-27B-v3-abliterated-int4-AutoRound

INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.

NVFP4 quantization of huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121).

Model Details

Item Value
Architecture Dense 27B + GDN (Mamba) + Attention hybrid
Base model Qwen/Qwen3.5-27B
Fine-tuned by huihui-ai (Qwopus v3 distillation + abliteration)
Quantized by YuYu1015
Model size ~25 GB (NVFP4, vs ~51 GB BF16 original)
Context length Up to 262,144 tokens
Thinking mode Supported (enable_thinking: true/false)
Tool calling Supported (qwen3_coder parser)
MTP Built-in MTP weights included (preserved in BF16)

Quantization Details

Item Value
Method llm-compressor (main branch, PR #2608)
Scheme NVFP4 (E2M1 + FP8 per-group scaling, group size 16)
Format compressed-tensors (main branch)
Calibration dataset HuggingFaceH4/ultrachat_200k (train_sft split)
Calibration samples 512
Calibration sequence length 2048
Hardware NVIDIA DGX Spark (GB10, 128GB unified memory)
Environment transformers>=5.0 + llm-compressor main (Qwen3.5 qwen3_5 model_type requires tf5)

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Layer Reason
lm_head Output head, sensitive to quantization noise
re:.*linear_attn.* GDN/DeltaNet (Mamba) layers — may output zeros if quantized
re:.*mtp\..* Multi-Token Prediction weights
re:.*visual\..* Vision encoder
re:.*embed_tokens$ Input embeddings

Serving with vLLM

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.5-27b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.90 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

  • NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing cvt.e2m1x2 instruction)
  • FP8 KV cache is not compatible with GDN non-causal attention layers; use --kv-cache-dtype auto
  • --language-model-only skips vision encoder profiling for text-only inference
  • Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

Credits


繁體中文

2026-04-13 量化上傳lm_head / linear_attn (GDN) / mtp / visual / embed_tokens 保留 BF16。

NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+

截至 2026 年 4 月,NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16(BF16 activation),FP4 的理論吞吐量優勢無法發揮。

精度與推理速度為首要考量,建議改用 INT4 AutoRound 版本: 👉 YuYu1015/Huihui-Qwopus3.5-27B-v3-abliterated-int4-AutoRound

INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑,校準更完整(品質保留約 99.5%),效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後,NVFP4 的真正優勢才能發揮。

huihui-ai/Huihui-Qwopus3.5-27B-v3-abliterated 的 NVFP4 量化版本,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化。

模型資訊

項目 數值
架構 Dense 27B + GDN (Mamba) + Attention 混合
基礎模型 Qwen/Qwen3.5-27B
微調者 huihui-ai(Qwopus v3 蒸餾 + abliteration)
量化者 YuYu1015
模型大小 ~25 GB(NVFP4,原版 BF16 約 51 GB)
Context 長度 最高 262,144 tokens
思考模式 支援(enable_thinking: true/false
工具呼叫 支援(qwen3_coder parser)
MTP 內建 MTP 權重(保留 BF16)

量化詳情

項目 數值
方法 llm-compressor(main 分支,PR #2608)
方案 NVFP4(E2M1 + FP8 逐群縮放,群組大小 16)
格式 compressed-tensors(main 分支)
校準資料集 HuggingFaceH4/ultrachat_200k (train_sft 分割)
校準樣本數 512
校準序列長度 2048
量化硬體 NVIDIA DGX Spark(GB10, 128GB 統一記憶體)
環境 transformers>=5.0 + llm-compressor main(Qwen3.5 qwen3_5 model_type 需要 tf5)

保留 BF16 的層

以下層未被量化以保持模型品質:

原因
lm_head 輸出頭,對量化雜訊敏感
re:.*linear_attn.* GDN/DeltaNet (Mamba) 層,量化後可能輸出零
re:.*mtp\..* Multi-Token Prediction 權重
re:.*visual\..* 視覺編碼器
re:.*embed_tokens$ 輸入嵌入

vLLM 部署

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.5-27b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.90 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) 相容性說明

  • NVFP4 在 SM121 上會退回 W4A16(原生 W4A4 路徑尚未支援,缺少 cvt.e2m1x2 指令)
  • FP8 KV cache 與 GDN non-causal attention 不相容,請使用 --kv-cache-dtype auto
  • --language-model-only 跳過視覺編碼器 profiling,加速純文字推理啟動
  • UMA 架構啟動前請先清除 page cache:sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

此模型已移除安全過濾機制(abliterated),可能產生不當內容。使用者須自行承擔所有風險與法律責任。

致謝

Downloads last month
97
Safetensors
Model size
11B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YuYu1015/Huihui-Qwopus3.5-27B-v3-abliterated-int4-AutoRound

Collection including YuYu1015/Huihui-Qwopus3.5-27B-v3-abliterated-int4-AutoRound