Qwopus3.5-27B-v3-AWQ-4bit
AutoAWQ-format 4-bit quantized version of Jackrong/Qwopus3.5-27B-v3.
This checkpoint preserves the hybrid Qwen3.5 DeltaNet + softmax architecture and MTP (Multi-Token Prediction) head from the BF16 source, exporting an AutoAWQ-compatible W4A16 checkpoint for broad runtime compatibility.
Verified Inference
Local export was completed on 2026-04-07 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:
auto-round==0.10.2transformers==5.3.0vllm==0.17.1
What was verified:
- AutoAWQ export completed successfully
quantization_config.jsonwritten withquant_method=awq- Output uses
bits=4,group_size=128,sym=false,zero_point=true - MTP weights are included in the main model shards
Quantization Strategy
AutoRound AutoAWQ export using W4A16 asymmetric group-wise quantization:
| Precision | Layers |
|---|---|
| INT4 weights + BF16 activations | most quantized linear layers |
| BF16 | lm_head, embed_tokens, self_attn.o_proj, DeltaNet linear_attn.out_proj, DeltaNet in_proj_a/in_proj_b, visual encoder, MTP sidecar |
AWQ details:
- weights: INT4
- activations: BF16/FP16 at inference time
- group size:
128 - asymmetric quantization:
sym=false,zero_point=true - format: AutoAWQ
gemm - calibration: 512 samples, 4096 sequence length
Architecture match with the BF16 source:
model_type=qwen3_564text layers (hybrid DeltaNet + softmax,full_attention_interval=4)mtp_num_hidden_layers=1max_position_embeddings=262144hidden_size=5120,intermediate_size=17408vocab_size=248320
Usage
vLLM
pip install -U vllm>=0.17.0 transformers>=5.3.0
Standard serving:
vllm serve mconcat/Qwopus3.5-27B-v3-AWQ-4bit \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3
With MTP speculative decoding:
vllm serve mconcat/Qwopus3.5-27B-v3-AWQ-4bit \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Transformers
This export is not intended for plain transformers inference. Use a runtime that understands AutoAWQ-format checkpoints, such as vLLM with AWQ support.
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Yes | Intended serving path for this AutoAWQ export |
| transformers >= 5.3.0 | No | Plain transformers is not the intended inference path for AutoAWQ checkpoints |
| AutoAWQ-compatible runtimes | Expected | Export format is AutoAWQ-style quant_method=awq, version=gemm |
| SGLang | Unknown | Not verified |
Notes
- This is an AutoAWQ-format export, not a compressed-tensors AWQ format.
- The output keeps
self_attn.o_projand DeltaNetlinear_attn.out_projin BF16 rather than 4-bit. - MTP weights are included in the model shards (no separate
model.mtp.safetensors). - The model includes a vision encoder (loaded but unused for text-only inference). Use
--skip-mm-profilingwith vLLM to skip vision encoder profiling. - Blackwell (SM120) note: If you encounter TMA-related crashes, apply the one-line vLLM patch to disable TMA on Blackwell: change
>= 9to9 <= x < 12invllm/model_executor/layers/fla/ops/utils.py.
- Downloads last month
- 4,901
Model tree for mconcat/Qwopus3.5-27B-v3-AWQ-4bit
Collection including mconcat/Qwopus3.5-27B-v3-AWQ-4bit
Collection
3 items • Updated