Instructions to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="gsting/Qwen3.6-35B-A3B-abliterated-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("gsting/Qwen3.6-35B-A3B-abliterated-FP8") model = AutoModelForImageTextToText.from_pretrained("gsting/Qwen3.6-35B-A3B-abliterated-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "gsting/Qwen3.6-35B-A3B-abliterated-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gsting/Qwen3.6-35B-A3B-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/gsting/Qwen3.6-35B-A3B-abliterated-FP8
- SGLang
How to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "gsting/Qwen3.6-35B-A3B-abliterated-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gsting/Qwen3.6-35B-A3B-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "gsting/Qwen3.6-35B-A3B-abliterated-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gsting/Qwen3.6-35B-A3B-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with Docker Model Runner:
docker model run hf.co/gsting/Qwen3.6-35B-A3B-abliterated-FP8
Update 2025-05-06: Replaced
chat_templateintokenizer_config.jsonwith the fixed version from froggeric/Qwen-Fixed-Chat-Templates.
Huihui-Qwen3.6-35B-A3B-abliterated-FP8
Vision-capable FP8-quantized abliterated Qwen3.6-35B-A3B (MoE, hybrid mamba/attention) for Nvidia DGX Spark and other FP8-capable hardware (~80 GB VRAM for full 262k context).
I've tested many abliterated models from HF, and only Huihui makes really good ones. Check "Claude" version if you like: batsclamp/Huihui-Qwen3.6-35B-A3B-Claude-4.6-Opus-abliterated-FP8
This one will give you ±50tps on full context when used with Eugr's vLLM (DGX Spark)
Model Lineage
- Base: Qwen/Qwen3.6-35B-A3B (BF16, hybrid linear-attention + full-attention MoE with 40 layers, 256 experts, vision encoder)
- Abliterated (refusals removed) by Huihui: huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated
- This repo: FP8 quantization matching the format of Qwen/Qwen3.6-35B-A3B-FP8
Why FP8
Qwen3.6-35B-A3B in BF16 is ~72 GB on disk. FP8 cuts that to ~37 GB while preserving vision layers and precision-sensitive modules in BF16. The expected throughput uplift on DGX Spark is on par with what we saw for Qwen3.5 (31 → 51 t/s, ~65%).
Quantization Details
Scheme: native FP8 blockwise, identical on-disk format to the official Qwen/Qwen3.6-35B-A3B-FP8.
| Field | Value |
|---|---|
quant_method |
fp8 |
activation_scheme |
dynamic (per-token, at inference) |
fmt |
e4m3 |
weight_block_size |
[128, 128] |
| Scale dtype / key | bf16, *.weight_scale_inv |
| Scale shape | (ceil(out/128), ceil(in/128)) |
Quantized (weights → FP8 e4m3, per-block [128, 128] scales):
- All 2D Linear
*.weightin language layers that aren't in the exclusion list, including:self_attn.{q,k,v,o}_proj(full-attention layers)linear_attn.{in_proj_qkv, in_proj_z, out_proj}(linear-attention / mamba layers)mlp.shared_expert.{gate,up,down}_proj- All 256 experts per MoE layer, un-fused to match Qwen's official per-expert layout:
mlp.experts.{0..255}.{gate_proj, up_proj, down_proj}.weight
Kept in BF16 (matches Qwen's modules_to_not_convert):
| Module | Reason |
|---|---|
lm_head |
Output head — precision-sensitive |
model.language_model.embed_tokens |
Embedding layer |
*.input_layernorm, *.post_attention_layernorm |
LayerNorms |
*.self_attn.{q_norm, k_norm} |
QK norms |
*.linear_attn.{A_log, conv1d, dt_bias, in_proj_a, in_proj_b, in_proj_ba, norm} |
Mamba state-space params (small, sensitive) |
*.mlp.gate, *.mlp.shared_expert_gate |
MoE router gates — routing precision matters |
model.visual.* |
Entire visual encoder (patch_embed, 27 ViT blocks, deepstack mergers, merger) |
mtp.* |
Multi-token prediction module |
Notable Implementation Notes
- Source experts were fused 3D (
mlp.experts.gate_up_proj[256, 1024, 2048],mlp.experts.down_proj[256, 2048, 512]) — we un-fuse them to the per-expert layout the official Qwen FP8 uses (mlp.experts.{E}.{gate, up, down}_proj.weight). This is what vLLM's Fp8 MoE loader expects. - Streaming quantization: processed one source shard at a time on the GPU; peak host memory ~6 GB. Avoids the llmcompressor pitfall where peak VM grew to 168 GB during the Compressing phase and got OOM-killed on the 128 GB DGX Spark unified-memory budget.
- Sanity check: round-trip dequantization median relative error ~2.2% per tensor (as expected for E4M3 blockwise).
Numbers
| BF16 source | This FP8 | |
|---|---|---|
| Size on disk | ~72 GB | ~37 GB |
| Tensors in index | 1045 (fused experts) | 64189 (un-fused) |
| FP8 weight tensors | — | 31738 |
| BF16 weight tensors | — | 32451 (incl. 31738 weight_scale_inv) |
Loading
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8",
dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8")
For vLLM, point at the repo — the quantization_config is already correctly set (quant_method: fp8, weight-block [128, 128], dynamic activations).
Disclaimer
Abliterated model. Not recommended if you expect a polite corporate assistant.
- Downloads last month
- 24
Model tree for gsting/Qwen3.6-35B-A3B-abliterated-FP8
Base model
Qwen/Qwen3.6-35B-A3B