Qwopus3.5-27B-v3-AWQ-4bit

AutoAWQ-format 4-bit quantized version of Jackrong/Qwopus3.5-27B-v3.

This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, exporting an AutoAWQ-compatible W4A16 checkpoint for broad runtime compatibility.

Verified Inference

Local export was completed on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

  • auto-round==0.10.2
  • transformers==5.3.0
  • vllm==0.17.1

What was verified:

  • AutoAWQ export completed successfully
  • quantization_config.json written with quant_method=awq
  • Output uses bits=4, group_size=128, sym=false, zero_point=true
  • MTP weights are included in the main model shards

Quantization Strategy

AutoRound AutoAWQ export using W4A16 asymmetric group-wise quantization:

Precision Layers
INT4 weights + BF16 activations most quantized linear layers
BF16 lm_head, embed_tokens, self_attn.o_proj, DeltaNet linear_attn.out_proj, DeltaNet in_proj_a/in_proj_b, visual encoder, MTP sidecar

AWQ details:

  • weights: INT4
  • activations: BF16/FP16 at inference time
  • group size: 128
  • asymmetric quantization: sym=false, zero_point=true
  • format: AutoAWQ gemm
  • calibration: 512 samples, 4096 sequence length

Architecture match with the BF16 source:

  • model_type=qwen3_5
  • 64 text layers (hybrid DeltaNet + softmax, full_attention_interval=4)
  • mtp_num_hidden_layers=1
  • max_position_embeddings=262144
  • hidden_size=5120, intermediate_size=17408
  • vocab_size=248320

Usage

vLLM

pip install -U vllm>=0.17.0 transformers>=5.3.0

Standard serving:

vllm serve mconcat/Qwopus3.5-27B-v3-AWQ-4bit \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP speculative decoding:

vllm serve mconcat/Qwopus3.5-27B-v3-AWQ-4bit \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers

This export is not intended for plain transformers inference. Use a runtime that understands AutoAWQ-format checkpoints, such as vLLM with AWQ support.

Compatibility

Framework Supported Notes
vLLM >= 0.17.0 Yes Intended serving path for this AutoAWQ export
transformers >= 5.3.0 No Plain transformers is not the intended inference path for AutoAWQ checkpoints
AutoAWQ-compatible runtimes Expected Export format is AutoAWQ-style quant_method=awq, version=gemm
SGLang Unknown Not verified

Notes

  • This is an AutoAWQ-format export, not a compressed-tensors AWQ format.
  • The output keeps self_attn.o_proj and DeltaNet linear_attn.out_proj in BF16 rather than 4-bit.
  • MTP weights are included in the model shards (no separate model.mtp.safetensors).
  • The model includes a vision encoder (loaded but unused for text-only inference). Use --skip-mm-profiling with vLLM to skip vision encoder profiling.
  • Blackwell (SM120) note: If you encounter TMA-related crashes, apply the one-line vLLM patch to disable TMA on Blackwell: change >= 9 to 9 <= x < 12 in vllm/model_executor/layers/fla/ops/utils.py.
Downloads last month
4,901
Safetensors
Model size
29B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/Qwopus3.5-27B-v3-AWQ-4bit

Base model

Qwen/Qwen3.5-27B
Quantized
(30)
this model

Collection including mconcat/Qwopus3.5-27B-v3-AWQ-4bit