Qwen3.6-35B-A3B-abliterated-FP8

Instructions to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="gsting/Qwen3.6-35B-A3B-abliterated-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("gsting/Qwen3.6-35B-A3B-abliterated-FP8")
model = AutoModelForImageTextToText.from_pretrained("gsting/Qwen3.6-35B-A3B-abliterated-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gsting/Qwen3.6-35B-A3B-abliterated-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gsting/Qwen3.6-35B-A3B-abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/gsting/Qwen3.6-35B-A3B-abliterated-FP8

SGLang

How to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "gsting/Qwen3.6-35B-A3B-abliterated-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gsting/Qwen3.6-35B-A3B-abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "gsting/Qwen3.6-35B-A3B-abliterated-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gsting/Qwen3.6-35B-A3B-abliterated-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use gsting/Qwen3.6-35B-A3B-abliterated-FP8 with Docker Model Runner:
```
docker model run hf.co/gsting/Qwen3.6-35B-A3B-abliterated-FP8
```

Update 2025-05-06: Replaced chat_template in tokenizer_config.json with the fixed version from froggeric/Qwen-Fixed-Chat-Templates.

Huihui-Qwen3.6-35B-A3B-abliterated-FP8

Vision-capable FP8-quantized abliterated Qwen3.6-35B-A3B (MoE, hybrid mamba/attention) for Nvidia DGX Spark and other FP8-capable hardware (~80 GB VRAM for full 262k context).

I've tested many abliterated models from HF, and only Huihui makes really good ones. Check "Claude" version if you like: batsclamp/Huihui-Qwen3.6-35B-A3B-Claude-4.6-Opus-abliterated-FP8

This one will give you ±50tps on full context when used with Eugr's vLLM (DGX Spark)

Model Lineage

Base: Qwen/Qwen3.6-35B-A3B (BF16, hybrid linear-attention + full-attention MoE with 40 layers, 256 experts, vision encoder)
Abliterated (refusals removed) by Huihui: huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated
This repo: FP8 quantization matching the format of Qwen/Qwen3.6-35B-A3B-FP8

Why FP8

Qwen3.6-35B-A3B in BF16 is ~72 GB on disk. FP8 cuts that to ~37 GB while preserving vision layers and precision-sensitive modules in BF16. The expected throughput uplift on DGX Spark is on par with what we saw for Qwen3.5 (31 → 51 t/s, ~65%).

Quantization Details

Scheme: native FP8 blockwise, identical on-disk format to the official Qwen/Qwen3.6-35B-A3B-FP8.

Field	Value
`quant_method`	`fp8`
`activation_scheme`	`dynamic` (per-token, at inference)
`fmt`	`e4m3`
`weight_block_size`	`[128, 128]`
Scale dtype / key	`bf16`, `*.weight_scale_inv`
Scale shape	`(ceil(out/128), ceil(in/128))`

Quantized (weights → FP8 e4m3, per-block [128, 128] scales):

All 2D Linear *.weight in language layers that aren't in the exclusion list, including:
- self_attn.{q,k,v,o}_proj (full-attention layers)
- linear_attn.{in_proj_qkv, in_proj_z, out_proj} (linear-attention / mamba layers)
- mlp.shared_expert.{gate,up,down}_proj
- All 256 experts per MoE layer, un-fused to match Qwen's official per-expert layout:
  - mlp.experts.{0..255}.{gate_proj, up_proj, down_proj}.weight

Kept in BF16 (matches Qwen's modules_to_not_convert):

Module	Reason
`lm_head`	Output head — precision-sensitive
`model.language_model.embed_tokens`	Embedding layer
`.input_layernorm`, `.post_attention_layernorm`	LayerNorms
`*.self_attn.{q_norm, k_norm}`	QK norms
`*.linear_attn.{A_log, conv1d, dt_bias, in_proj_a, in_proj_b, in_proj_ba, norm}`	Mamba state-space params (small, sensitive)
`.mlp.gate`, `.mlp.shared_expert_gate`	MoE router gates — routing precision matters
`model.visual.*`	Entire visual encoder (patch_embed, 27 ViT blocks, deepstack mergers, merger)
`mtp.*`	Multi-token prediction module

Notable Implementation Notes

Source experts were fused 3D (mlp.experts.gate_up_proj[256, 1024, 2048], mlp.experts.down_proj[256, 2048, 512]) — we un-fuse them to the per-expert layout the official Qwen FP8 uses (mlp.experts.{E}.{gate, up, down}_proj.weight). This is what vLLM's Fp8 MoE loader expects.
Streaming quantization: processed one source shard at a time on the GPU; peak host memory ~6 GB. Avoids the llmcompressor pitfall where peak VM grew to 168 GB during the Compressing phase and got OOM-killed on the 128 GB DGX Spark unified-memory budget.
Sanity check: round-trip dequantization median relative error ~2.2% per tensor (as expected for E4M3 blockwise).

Numbers

	BF16 source	This FP8
Size on disk	~72 GB	~37 GB
Tensors in index	1045 (fused experts)	64189 (un-fused)
FP8 weight tensors	—	31738
BF16 weight tensors	—	32451 (incl. 31738 `weight_scale_inv`)

Loading

from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
    "batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8",
    dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8")

For vLLM, point at the repo — the quantization_config is already correctly set (quant_method: fp8, weight-block [128, 128], dynamic activations).

Disclaimer

Abliterated model. Not recommended if you expect a polite corporate assistant.

Downloads last month: 24

Safetensors

Model size

36B params

Tensor type

BF16

F8_E4M3

Model tree for gsting/Qwen3.6-35B-A3B-abliterated-FP8

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(439)

this model

Collection including gsting/Qwen3.6-35B-A3B-abliterated-FP8

Qwen3.6

Collection

8 items • Updated 11 days ago