YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GLM-4.7-Flash-REAP-23B-A3B β NVFP4 Quantization
NVFP4 (4-bit float, group_size=16) quantization of GLM-4.7-Flash-REAP-23B-A3B using NVIDIA Model Optimizer 0.41.0.
Calibrated on 512 samples (code, instruction, agentic, structured data). Model size: 13 GB (down from 46 GB BF16).
β οΈ Hardware requirement: SM12.0 (Blackwell) only
This model uses the Marlin NVFP4 GEMM kernel, which requires a Blackwell GPU:
- Consumer: RTX 5080, RTX 5090, RTX 5070 Ti Super, ...
- Data center: B100, B200, GB200
It will not run on Ampere (SM80), Ada (SM89), or Hopper (SM90).
Setup
1. Python environment
python3 -m venv venv
source venv/bin/activate
# PyTorch with CUDA 12.8 (required for SM12.0)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Build dependencies
pip install numpy setuptools_scm packaging ninja
pip install cmake # adds cmake to venv/bin
2. Build vLLM from source (patched)
A patched vLLM is required. Three patches are needed on top of commit 628302114
(v0.16.1rc1.dev34):
mla_attention.py: unconditional.weightaccess crashes on quantized layers that store weights asweight_packed(NVFP4 Marlin) orweight_packed(INT4 compressed-tensors). Guard added.glm4_moe_lite.py:loaded_params.add(name)called whenname is Noneduring shared-expert weight loading. Guard added.triton_mla.py: FP8 KV cache support for the TRITON_MLA backend. Upstream raisesNotImplementedErrorfor FP8 KV with Triton MLA. This patch adds pre-dequantization of the FP8 cache and FP8-quantized query back to BF16 before the Triton decode kernel, enabling FP8 KV cache on SM12.0 where TRITON_MLA is the only available backend. This doubles the effective KV cache capacity (FP8 = 1 byte vs BF16 = 2 bytes per element).
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 628302114 # tested commit
# Apply the three patches (included in this repo as vllm_patches.diff)
git apply /path/to/vllm_patches.diff
# Build (set SM version for your GPU β 12.0 = RTX 5080/5090)
export PATH="$(pwd)/../venv/bin:$PATH"
export TORCH_CUDA_ARCH_LIST="12.0"
pip install -e . --no-build-isolation
cd ..
Build takes ~15 minutes. After building, verify: python -c "import vllm; print(vllm.__version__)".
3. Serve
Two configurations are provided: Standard (BF16 KV cache, no triton_mla patch needed) and Extended Context (FP8 KV cache, requires the triton_mla.py patch).
Option A: Standard β BF16 KV cache (4,928 tokens)
Only requires patches 1 and 2. This is simpler and allows multiple concurrent requests.
export TORCH_CUDA_ARCH_LIST="12.0"
export FLASHINFER_CUDA_ARCH_LIST="12.0"
export VLLM_TEST_FORCE_FP8_MARLIN=1 # enables Marlin NVFP4 GEMM
export VLLM_NVFP4_GEMM_BACKEND=marlin
export PYTORCH_ALLOC_CONF=expandable_segments:True
vllm serve ./GLM-4.7-Flash-REAP-23B-A3B-NVFP4 \
--trust-remote-code \
--dtype bfloat16 \
--quantization modelopt \
--kv-cache-dtype auto \
--max-model-len 4928 \
--no-enable-prefix-caching \
--max-num-seqs 8 \
--gpu-memory-utilization 0.965 \
--enable-chunked-prefill \
--max-num-batched-tokens 512 \
--enforce-eager \
--reasoning-parser deepseek_r1 \
--override-generation-config '{"temperature": 0.0, "max_tokens": 3000}'
Option B: Extended Context β FP8 KV cache (11,728 tokens)
Requires all three patches. Trades concurrency (single sequence) for 2.38Γ more context. The FP8 KV cache halves the per-token KV memory, and restricting to a single sequence dedicates the entire KV budget to one request.
export TORCH_CUDA_ARCH_LIST="12.0"
export FLASHINFER_CUDA_ARCH_LIST="12.0"
export VLLM_TEST_FORCE_FP8_MARLIN=1 # enables Marlin NVFP4 GEMM
export VLLM_NVFP4_GEMM_BACKEND=marlin
export PYTORCH_ALLOC_CONF=expandable_segments:True
vllm serve ./GLM-4.7-Flash-REAP-23B-A3B-NVFP4 \
--trust-remote-code \
--dtype bfloat16 \
--quantization modelopt \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 11728 \
--no-enable-prefix-caching \
--max-num-seqs 1 \
--gpu-memory-utilization 0.97 \
--enable-chunked-prefill \
--max-num-batched-tokens 256 \
--num-gpu-blocks-override 733 \
--enforce-eager \
--reasoning-parser deepseek_r1 \
--override-generation-config '{"temperature": 0.0, "max_tokens": 11728}'
Why these flags (Option B):
| Flag | Reason |
|---|---|
--kv-cache-dtype fp8_e4m3 |
FP8 KV cache β halves KV memory, 2Γ more tokens |
--num-gpu-blocks-override 733 |
Maximum blocks that fit without OOM on RTX 5080 (16 GB). Binary-searched: 733 works, 734 OOMs during warmup. Your GPU may differ slightly β reduce if you hit OOM |
--max-model-len 11728 |
733 blocks Γ 16 tokens/block = 11,728 tokens |
--max-num-seqs 1 |
Single sequence gets the full KV budget |
--max-num-batched-tokens 256 |
Must be β€256 to avoid OOM during vLLM's profiling pass. With 512, the profiling forward requires ~1.09 GiB which exceeds free VRAM after model loading |
--gpu-memory-utilization 0.97 |
Maximizes VRAM for KV cache |
--quantization modelopt |
Loads hf_quant_config.json format |
--enforce-eager |
Avoids CUDA graph compilation overhead |
--reasoning-parser deepseek_r1 |
Model uses <think>β¦</think> tokens (IDs 154841/154842); separates reasoning from response |
temperature=0 |
Mandatory. Any randomness causes thinking to spiral into garbage on deterministic tasks |
4. Inference
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="./GLM-4.7-Flash-REAP-23B-A3B-NVFP4",
messages=[{"role": "user", "content": "Your prompt here"}],
temperature=0,
max_tokens=3000,
)
print(response.choices[0].message.content)
Memory breakdown (RTX 5080, 16 GB)
Option A: BF16 KV cache
| Component | Size |
|---|---|
| Model weights (NVFP4) | 13.46 GiB |
| Inference activation peak | ~0.56 GiB |
| CUDA/PyTorch overhead | ~0.49 GiB |
| KV cache (BF16) | ~0.49 GiB β 4,928 tokens |
Max concurrency: ~8 simultaneous requests at reduced context each.
Option B: FP8 KV cache
| Component | Size |
|---|---|
| Model weights (NVFP4) | 13.46 GiB |
| Inference activation peak | ~1.09 GiB |
| CUDA/PyTorch overhead | ~0.49 GiB |
| KV cache (FP8) | ~0.34 GiB β 11,728 tokens |
Max concurrency: 1 (single sequence mode).
Attention backend
This model uses MLA (Multi-head Latent Attention) with qk_nope_head_dim=192.
On SM12.0, only the TRITON_MLA backend is compatible:
FLASHINFER_MLArequiresqk_nope_head_dim=128CUTLASS_MLArequires SM10.x
vLLM selects TRITON_MLA automatically. Generation speed is ~33 tokens/sec
on RTX 5080.
Thinking mode
The model generates internal reasoning inside <think>β¦</think> tokens before
its final answer. With --reasoning-parser deepseek_r1, vLLM routes this to
reasoning_content in the API response and it is hidden in Open WebUI.
Typical breakdown per request: ~1,800 thinking tokens + ~200 answer tokens.
Patches explained
The file vllm_patches.diff contains three patches against vLLM commit 628302114.
Apply with git apply vllm_patches.diff.
Patch 1: mla_attention.py β guard .weight access
Problem: self.kv_b_proj.weight.dtype crashes when the layer uses quantized
storage (NVFP4 stores weight_packed as int32, not weight as float).
Fix: Use getattr(self.kv_b_proj, "weight", None) and only cast when the
weight is stored as a float dtype (BF16/FP16/FP8).
Patch 2: glm4_moe_lite.py β guard name is None
Problem: loaded_params.add(name) is called when name is None during
shared-expert weight loading, causing a crash.
Fix: Add and name is not None guard.
Patch 3: triton_mla.py β FP8 KV cache for TRITON_MLA
Problem: The TRITON_MLA backend raises NotImplementedError when FP8 KV
cache is requested. On SM12.0, TRITON_MLA is the only available MLA backend,
so FP8 KV cache is completely blocked.
Fix: Three changes:
- Add
"fp8"and"fp8_e4m3"tosupported_kv_cache_dtypes - Remove the
NotImplementedErrorguards in__init__andforward_mqa - Add pre-dequantization before the Triton decode kernel:
- KV cache: FP8 β float32 (Γ k_scale) β BF16
- Query: FP8 β float32 (Γ q_scale) β BF16 (the caller may have quantized
the query to FP8 via
_decode_concat_quant_fp8_op)
The float32 intermediary is required because PyTorch 2.10.0 does not support
direct .to() conversion or arithmetic on Float8_e4m3fn tensors.
Quantization details
- Tool: nvidia-modelopt 0.41.0
- Format: NVFP4, group_size=16, AWQ-lite calibration
- Calibration: 512 samples (code, instruction, agentic SWE trajectories, structured data)
- Excluded:
lm_head(kept BF16) - TensorQuantizers: 21,144
- MoE handling:
Glm4MoeLiteNaiveMoeexperts (packed as 3D tensors) were temporarily replaced withExpandedNaiveMoeduring calibration to expose individualnn.Linearlayers to the quantizer, then saved in standard format.
Tested versions
| Package | Version |
|---|---|
| vllm | 0.16.1rc1.dev34+g628302114 (patched) |
| torch | 2.10.0+cu128 |
| transformers | 5.2.0 |
| flashinfer | 0.6.4 |
| nvidia-modelopt | 0.41.0 |
| CUDA | 12.8 |
| Driver | 590.48.01 |
- Downloads last month
- 9