Gemma 4 E4B DECKARD HERETIC Uncensored NVFP4

EAGLE speculative decoding drafter for Gemma 4 31B DECKARD HERETIC Uncensored NVFP4.

A 42-layer E4B (EAGLE for Blackwell) model quantized to NVFP4 AWQ using NVIDIA ModelOpt 0.42.0. Designed for EAGLE-based speculative decoding on NVIDIA DGX Spark (GB10, SM 12.1) and other Blackwell GPUs.

Model Details

Property Value
Architecture Gemma 4 (E4B EAGLE Drafter)
Target Model AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4
Layers 42 (35 sliding-window + 7 full-attention)
Hidden Size 2560
Attention Heads 8 (2 KV heads), head_dim=256, global_head_dim=512
Sliding Window 512 tokens
Max Context 131,072 tokens
Quantization NVFP4 AWQ (ModelOpt 0.42.0)
Model Size 9.6 GB
Vocabulary 262,144 tokens

Performance (DGX Spark)

Benchmarked on NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory) with 31B DECKARD AWQ_FULL target + this E4B drafter. 5 speculative tokens, 300 max tokens per request.

Concurrent Aggregate tok/s Per-Request tok/s Avg Latency (300 tok)
1 7.6 8.9 39.4s
2 21.7 10.8 27.7s
4 42.7 10.7 28.1s

Zero errors across all test runs. Throughput scales linearly with concurrency.

Quick Start

1. Download both models

pip install -U huggingface-hub

# Target model (31B)
huggingface-cli download AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4 \
  --local-dir ~/models/deckard-31b

# This drafter model (E4B)
huggingface-cli download AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 \
  --local-dir ~/models/e4b-drafter

2. Get the patched vLLM files

Three patches are required for Gemma 4 speculative decoding. Download from the GitHub repo:

for f in eagle_patched.py serving_chat_patched.py modelopt_patched.py; do
  curl -LO https://raw.githubusercontent.com/AEON-7/Gemma-4-31B-DECKARD-HERETIC-Uncensored-NVFP4/main/$f
done

3. Launch with Docker Compose

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
    container_name: vllm-deckard-31b-spec
    restart: unless-stopped
    network_mode: host
    volumes:
      - ~/models/deckard-31b:/models/deckard
      - ~/models/e4b-drafter:/models/e4b-drafter
      - ./modelopt_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/modelopt.py
      - ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
      - ./eagle_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py
    environment:
      - VLLM_TEST_FORCE_FP8_MARLIN=1
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/deckard \
          --served-model-name deckard-31b \
          --quantization modelopt \
          --dtype auto \
          --kv-cache-dtype fp8 \
          --tensor-parallel-size 1 \
          --max-model-len 131072 \
          --max-num-seqs 4 \
          --gpu-memory-utilization 0.85 \
          --trust-remote-code \
          --host 0.0.0.0 --port 8000 \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4 \
          --reasoning-parser gemma4 \
          --speculative-config '{"method":"draft_model","model":"/models/e4b-drafter","num_speculative_tokens":5,"quantization":"modelopt"}'
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

4. Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deckard-31b",
    "messages": [{"role": "user", "content": "Explain quantum entanglement."}],
    "max_tokens": 200
  }'

Required vLLM Patches

Three patches to vLLM 0.19.1 are required for speculative decoding with Gemma 4. All are available in the target model GitHub repo.

Patch What it fixes
eagle_patched.py Removes multimodal spec decode guard, adds Gemma4 model whitelist, supports multi-group KV cache (heterogeneous head_dim=256/512)
serving_chat_patched.py Fixes non-streaming reasoning parser — <|channel> tokens stripped by skip_special_tokens=True
modelopt_patched.py NVFP4_AWQ quant_algo support, AWQ pre_quant_scale handling, FP8 NaN scrubbing

Heterogeneous Attention

This E4B drafter mirrors the Gemma 4 heterogeneous attention design:

  • 35 sliding-window layershead_dim=256, window of 512 tokens, default RoPE (theta=10000)
  • 7 full-attention layershead_dim=512, global attention, proportional RoPE (theta=1M, partial_rotary_factor=0.25)

This creates two distinct KV cache groups, handled by the eagle_patched.py multi-group KV cache fix.

Related Models

Model Type Size Link
Gemma 4 31B DECKARD AWQ_FULL (target) Dense NVFP4 20.5 GB HuggingFace | GitHub
Gemma 4 31B DECKARD SVDQuant Dense NVFP4 20.9 GB HuggingFace
SuperGemma4 26B MoE MoE NVFP4 15.3 GB HuggingFace
vLLM AWQ Container Docker GHCR

License

This model inherits the Gemma license from Google.

Downloads last month
110
Safetensors
Model size
6B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4