YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GLM-4.7-Flash-REAP-23B-A3B — NVFP4 Quantization

NVFP4 (4-bit float, group_size=16) quantization of GLM-4.7-Flash-REAP-23B-A3B using NVIDIA Model Optimizer 0.41.0.

Calibrated on 512 samples (code, instruction, agentic, structured data). Model size: 13 GB (down from 46 GB BF16).

⚠️ Hardware requirement: SM12.0 (Blackwell) only

This model uses the Marlin NVFP4 GEMM kernel, which requires a Blackwell GPU:

Consumer: RTX 5080, RTX 5090, RTX 5070 Ti Super, ...
Data center: B100, B200, GB200

It will not run on Ampere (SM80), Ada (SM89), or Hopper (SM90).

Setup

1. Python environment

python3 -m venv venv
source venv/bin/activate

# PyTorch with CUDA 12.8 (required for SM12.0)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Build dependencies
pip install numpy setuptools_scm packaging ninja
pip install cmake          # adds cmake to venv/bin

2. Build vLLM from source (patched)

A patched vLLM is required. Three patches are needed on top of commit 628302114 (v0.16.1rc1.dev34):

mla_attention.py: unconditional .weight access crashes on quantized layers that store weights as weight_packed (NVFP4 Marlin) or weight_packed (INT4 compressed-tensors). Guard added.
glm4_moe_lite.py: loaded_params.add(name) called when name is None during shared-expert weight loading. Guard added.
triton_mla.py: FP8 KV cache support for the TRITON_MLA backend. Upstream raises NotImplementedError for FP8 KV with Triton MLA. This patch adds pre-dequantization of the FP8 cache and FP8-quantized query back to BF16 before the Triton decode kernel, enabling FP8 KV cache on SM12.0 where TRITON_MLA is the only available backend. This doubles the effective KV cache capacity (FP8 = 1 byte vs BF16 = 2 bytes per element).

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 628302114          # tested commit

# Apply the three patches (included in this repo as vllm_patches.diff)
git apply /path/to/vllm_patches.diff

# Build (set SM version for your GPU — 12.0 = RTX 5080/5090)
export PATH="$(pwd)/../venv/bin:$PATH"
export TORCH_CUDA_ARCH_LIST="12.0"
pip install -e . --no-build-isolation
cd ..

Build takes ~15 minutes. After building, verify: python -c "import vllm; print(vllm.__version__)".

3. Serve

Two configurations are provided: Standard (BF16 KV cache, no triton_mla patch needed) and Extended Context (FP8 KV cache, requires the triton_mla.py patch).

Option A: Standard — BF16 KV cache (4,928 tokens)

Only requires patches 1 and 2. This is simpler and allows multiple concurrent requests.

export TORCH_CUDA_ARCH_LIST="12.0"
export FLASHINFER_CUDA_ARCH_LIST="12.0"
export VLLM_TEST_FORCE_FP8_MARLIN=1       # enables Marlin NVFP4 GEMM
export VLLM_NVFP4_GEMM_BACKEND=marlin
export PYTORCH_ALLOC_CONF=expandable_segments:True

vllm serve ./GLM-4.7-Flash-REAP-23B-A3B-NVFP4 \
    --trust-remote-code \
    --dtype bfloat16 \
    --quantization modelopt \
    --kv-cache-dtype auto \
    --max-model-len 4928 \
    --no-enable-prefix-caching \
    --max-num-seqs 8 \
    --gpu-memory-utilization 0.965 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 512 \
    --enforce-eager \
    --reasoning-parser deepseek_r1 \
    --override-generation-config '{"temperature": 0.0, "max_tokens": 3000}'

Option B: Extended Context — FP8 KV cache (11,728 tokens)

Requires all three patches. Trades concurrency (single sequence) for 2.38× more context. The FP8 KV cache halves the per-token KV memory, and restricting to a single sequence dedicates the entire KV budget to one request.

export TORCH_CUDA_ARCH_LIST="12.0"
export FLASHINFER_CUDA_ARCH_LIST="12.0"
export VLLM_TEST_FORCE_FP8_MARLIN=1       # enables Marlin NVFP4 GEMM
export VLLM_NVFP4_GEMM_BACKEND=marlin
export PYTORCH_ALLOC_CONF=expandable_segments:True

vllm serve ./GLM-4.7-Flash-REAP-23B-A3B-NVFP4 \
    --trust-remote-code \
    --dtype bfloat16 \
    --quantization modelopt \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 11728 \
    --no-enable-prefix-caching \
    --max-num-seqs 1 \
    --gpu-memory-utilization 0.97 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 256 \
    --num-gpu-blocks-override 733 \
    --enforce-eager \
    --reasoning-parser deepseek_r1 \
    --override-generation-config '{"temperature": 0.0, "max_tokens": 11728}'

Why these flags (Option B):

Flag	Reason
`--kv-cache-dtype fp8_e4m3`	FP8 KV cache — halves KV memory, 2× more tokens
`--num-gpu-blocks-override 733`	Maximum blocks that fit without OOM on RTX 5080 (16 GB). Binary-searched: 733 works, 734 OOMs during warmup. Your GPU may differ slightly — reduce if you hit OOM
`--max-model-len 11728`	733 blocks × 16 tokens/block = 11,728 tokens
`--max-num-seqs 1`	Single sequence gets the full KV budget
`--max-num-batched-tokens 256`	Must be ≤256 to avoid OOM during vLLM's profiling pass. With 512, the profiling forward requires ~1.09 GiB which exceeds free VRAM after model loading
`--gpu-memory-utilization 0.97`	Maximizes VRAM for KV cache
`--quantization modelopt`	Loads hf_quant_config.json format
`--enforce-eager`	Avoids CUDA graph compilation overhead
`--reasoning-parser deepseek_r1`	Model uses `<think>…</think>` tokens (IDs 154841/154842); separates reasoning from response
`temperature=0`	Mandatory. Any randomness causes thinking to spiral into garbage on deterministic tasks

4. Inference

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="./GLM-4.7-Flash-REAP-23B-A3B-NVFP4",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0,
    max_tokens=3000,
)
print(response.choices[0].message.content)

Memory breakdown (RTX 5080, 16 GB)

Option A: BF16 KV cache

Component	Size
Model weights (NVFP4)	13.46 GiB
Inference activation peak	~0.56 GiB
CUDA/PyTorch overhead	~0.49 GiB
KV cache (BF16)	~0.49 GiB → 4,928 tokens

Max concurrency: ~8 simultaneous requests at reduced context each.

Option B: FP8 KV cache

Component	Size
Model weights (NVFP4)	13.46 GiB
Inference activation peak	~1.09 GiB
CUDA/PyTorch overhead	~0.49 GiB
KV cache (FP8)	~0.34 GiB → 11,728 tokens

Max concurrency: 1 (single sequence mode).

Attention backend

This model uses MLA (Multi-head Latent Attention) with qk_nope_head_dim=192. On SM12.0, only the TRITON_MLA backend is compatible:

FLASHINFER_MLA requires qk_nope_head_dim=128
CUTLASS_MLA requires SM10.x

vLLM selects TRITON_MLA automatically. Generation speed is ~33 tokens/sec on RTX 5080.

Thinking mode

The model generates internal reasoning inside <think>…</think> tokens before its final answer. With --reasoning-parser deepseek_r1, vLLM routes this to reasoning_content in the API response and it is hidden in Open WebUI.

Typical breakdown per request: ~1,800 thinking tokens + ~200 answer tokens.

Patches explained

The file vllm_patches.diff contains three patches against vLLM commit 628302114. Apply with git apply vllm_patches.diff.

Patch 1: `mla_attention.py` — guard `.weight` access

Problem: self.kv_b_proj.weight.dtype crashes when the layer uses quantized storage (NVFP4 stores weight_packed as int32, not weight as float).

Fix: Use getattr(self.kv_b_proj, "weight", None) and only cast when the weight is stored as a float dtype (BF16/FP16/FP8).

Patch 2: `glm4_moe_lite.py` — guard `name is None`

Problem: loaded_params.add(name) is called when name is None during shared-expert weight loading, causing a crash.

Fix: Add and name is not None guard.

Patch 3: `triton_mla.py` — FP8 KV cache for TRITON_MLA

Problem: The TRITON_MLA backend raises NotImplementedError when FP8 KV cache is requested. On SM12.0, TRITON_MLA is the only available MLA backend, so FP8 KV cache is completely blocked.

Fix: Three changes:

Add "fp8" and "fp8_e4m3" to supported_kv_cache_dtypes
Remove the NotImplementedError guards in __init__ and forward_mqa
Add pre-dequantization before the Triton decode kernel:
- KV cache: FP8 → float32 (× k_scale) → BF16
- Query: FP8 → float32 (× q_scale) → BF16 (the caller may have quantized the query to FP8 via _decode_concat_quant_fp8_op)

The float32 intermediary is required because PyTorch 2.10.0 does not support direct .to() conversion or arithmetic on Float8_e4m3fn tensors.

Quantization details

Tool: nvidia-modelopt 0.41.0
Format: NVFP4, group_size=16, AWQ-lite calibration
Calibration: 512 samples (code, instruction, agentic SWE trajectories, structured data)
Excluded: lm_head (kept BF16)
TensorQuantizers: 21,144
MoE handling: Glm4MoeLiteNaiveMoe experts (packed as 3D tensors) were temporarily replaced with ExpandedNaiveMoe during calibration to expose individual nn.Linear layers to the quantizer, then saved in standard format.

Tested versions

Package	Version
vllm	0.16.1rc1.dev34+g628302114 (patched)
torch	2.10.0+cu128
transformers	5.2.0
flashinfer	0.6.4
nvidia-modelopt	0.41.0
CUDA	12.8
Driver	590.48.01

Downloads last month: 9

Safetensors

Model size

13B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support