Qwopus3.5-27B-v3-AWQ-4bit

AutoAWQ-format 4-bit quantized version of Jackrong/Qwopus3.5-27B-v3.

This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, exporting an AutoAWQ-compatible W4A16 checkpoint for broad runtime compatibility.

Verified Inference

Local export was completed on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

auto-round==0.10.2
transformers==5.3.0
vllm==0.17.1

What was verified:

AutoAWQ export completed successfully
quantization_config.json written with quant_method=awq
Output uses bits=4, group_size=128, sym=false, zero_point=true
MTP weights are included in the main model shards

Quantization Strategy

AutoRound AutoAWQ export using W4A16 asymmetric group-wise quantization:

Precision	Layers
INT4 weights + BF16 activations	most quantized linear layers
BF16	`lm_head`, `embed_tokens`, `self_attn.o_proj`, DeltaNet `linear_attn.out_proj`, DeltaNet `in_proj_a`/`in_proj_b`, visual encoder, MTP sidecar

AWQ details:

weights: INT4
activations: BF16/FP16 at inference time
group size: 128
asymmetric quantization: sym=false, zero_point=true
format: AutoAWQ gemm
calibration: 512 samples, 4096 sequence length

Architecture match with the BF16 source:

model_type=qwen3_5
64 text layers (hybrid DeltaNet + softmax, full_attention_interval=4)
mtp_num_hidden_layers=1
max_position_embeddings=262144
hidden_size=5120, intermediate_size=17408
vocab_size=248320

Usage

vLLM

pip install -U vllm>=0.17.0 transformers>=5.3.0

Standard serving:

vllm serve mconcat/Qwopus3.5-27B-v3-AWQ-4bit \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP speculative decoding:

vllm serve mconcat/Qwopus3.5-27B-v3-AWQ-4bit \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

This export is not intended for plain transformers inference. Use a runtime that understands AutoAWQ-format checkpoints, such as vLLM with AWQ support.

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Yes	Intended serving path for this AutoAWQ export
transformers >= 5.3.0	No	Plain `transformers` is not the intended inference path for AutoAWQ checkpoints
AutoAWQ-compatible runtimes	Expected	Export format is AutoAWQ-style `quant_method=awq`, `version=gemm`
SGLang	Unknown	Not verified

Notes

This is an AutoAWQ-format export, not a compressed-tensors AWQ format.
The output keeps self_attn.o_proj and DeltaNet linear_attn.out_proj in BF16 rather than 4-bit.
MTP weights are included in the model shards (no separate model.mtp.safetensors).
The model includes a vision encoder (loaded but unused for text-only inference). Use --skip-mm-profiling with vLLM to skip vision encoder profiling.
Blackwell (SM120) note: If you encounter TMA-related crashes, apply the one-line vLLM patch to disable TMA on Blackwell: change >= 9 to 9 <= x < 12 in vllm/model_executor/layers/fla/ops/utils.py.