Gemma 4 E4B DECKARD HERETIC Uncensored NVFP4

EAGLE speculative decoding drafter for Gemma 4 31B DECKARD HERETIC Uncensored NVFP4.

A 42-layer E4B (EAGLE for Blackwell) model quantized to NVFP4 AWQ using NVIDIA ModelOpt 0.42.0. Designed for EAGLE-based speculative decoding on NVIDIA DGX Spark (GB10, SM 12.1) and other Blackwell GPUs.

Model Details

Property	Value
Architecture	Gemma 4 (E4B EAGLE Drafter)
Target Model	AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4
Layers	42 (35 sliding-window + 7 full-attention)
Hidden Size	2560
Attention Heads	8 (2 KV heads), head_dim=256, global_head_dim=512
Sliding Window	512 tokens
Max Context	131,072 tokens
Quantization	NVFP4 AWQ (ModelOpt 0.42.0)
Model Size	9.6 GB
Vocabulary	262,144 tokens

Performance (DGX Spark)

Benchmarked on NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory) with 31B DECKARD AWQ_FULL target + this E4B drafter. 5 speculative tokens, 300 max tokens per request.

Concurrent	Aggregate tok/s	Per-Request tok/s	Avg Latency (300 tok)
1	7.6	8.9	39.4s
2	21.7	10.8	27.7s
4	42.7	10.7	28.1s

Zero errors across all test runs. Throughput scales linearly with concurrency.

Quick Start

1. Download both models

pip install -U huggingface-hub

# Target model (31B)
huggingface-cli download AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4 \
  --local-dir ~/models/deckard-31b

# This drafter model (E4B)
huggingface-cli download AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 \
  --local-dir ~/models/e4b-drafter

2. Get the patched vLLM files

Three patches are required for Gemma 4 speculative decoding. Download from the GitHub repo:

for f in eagle_patched.py serving_chat_patched.py modelopt_patched.py; do
  curl -LO https://raw.githubusercontent.com/AEON-7/Gemma-4-31B-DECKARD-HERETIC-Uncensored-NVFP4/main/$f
done

3. Launch with Docker Compose

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
    container_name: vllm-deckard-31b-spec
    restart: unless-stopped
    network_mode: host
    volumes:
      - ~/models/deckard-31b:/models/deckard
      - ~/models/e4b-drafter:/models/e4b-drafter
      - ./modelopt_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/modelopt.py
      - ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
      - ./eagle_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py
    environment:
      - VLLM_TEST_FORCE_FP8_MARLIN=1
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/deckard \
          --served-model-name deckard-31b \
          --quantization modelopt \
          --dtype auto \
          --kv-cache-dtype fp8 \
          --tensor-parallel-size 1 \
          --max-model-len 131072 \
          --max-num-seqs 4 \
          --gpu-memory-utilization 0.85 \
          --trust-remote-code \
          --host 0.0.0.0 --port 8000 \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4 \
          --reasoning-parser gemma4 \
          --speculative-config '{"method":"draft_model","model":"/models/e4b-drafter","num_speculative_tokens":5,"quantization":"modelopt"}'
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

4. Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deckard-31b",
    "messages": [{"role": "user", "content": "Explain quantum entanglement."}],
    "max_tokens": 200
  }'

Required vLLM Patches

Three patches to vLLM 0.19.1 are required for speculative decoding with Gemma 4. All are available in the target model GitHub repo.

Patch	What it fixes
`eagle_patched.py`	Removes multimodal spec decode guard, adds Gemma4 model whitelist, supports multi-group KV cache (heterogeneous head_dim=256/512)
`serving_chat_patched.py`	Fixes non-streaming reasoning parser — `<\|channel>` tokens stripped by `skip_special_tokens=True`
`modelopt_patched.py`	NVFP4_AWQ quant_algo support, AWQ pre_quant_scale handling, FP8 NaN scrubbing

Heterogeneous Attention

This E4B drafter mirrors the Gemma 4 heterogeneous attention design:

35 sliding-window layers — head_dim=256, window of 512 tokens, default RoPE (theta=10000)
7 full-attention layers — head_dim=512, global attention, proportional RoPE (theta=1M, partial_rotary_factor=0.25)

This creates two distinct KV cache groups, handled by the eagle_patched.py multi-group KV cache fix.

Related Models

Model	Type	Size	Link
Gemma 4 31B DECKARD AWQ_FULL (target)	Dense NVFP4	20.5 GB	HuggingFace \| GitHub
Gemma 4 31B DECKARD SVDQuant	Dense NVFP4	20.9 GB	HuggingFace
SuperGemma4 26B MoE	MoE NVFP4	15.3 GB	HuggingFace
vLLM AWQ Container	Docker	—	GHCR

License

This model inherits the Gemma license from Google.

Downloads last month: 110

Safetensors

Model size

6B params

Tensor type

BF16

F8_E4M3

Model tree for AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4

Base model

google/gemma-4-31B-it

Finetuned

DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking

Quantized

(9)

this model