Gemma-4-31B-it NVFP4A16 Quantization (Weight-only FP4)

This is a NVFP4A16 quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for deployment on NVIDIA Blackwell GPUs with vLLM.

Model Details

Property Value
Base Model google/gemma-4-31B-it
Quantization Method NVFP4A16 (weight-only FP4)
Weight Precision FP4 (4-bit floating point)
Activation Precision BF16 (weight-only quantization)
Group Size 16
Quantization Library llm-compressor 0.10.0.1
Format compressed-tensors (nvfp4-pack-quantized)
Architecture Gemma4ForConditionalGeneration
Layers 60 decoder layers
Hidden Size 5376
Context Window 256K tokens
Vision Tower SigLIP (27 layers, preserved in BF16)
Quantized Components Text decoder + projector (vision tower preserved in BF16)

Hardware Requirements

  • Verified Deployment: NVIDIA RTX PRO 6000 (96GB VRAM, Blackwell sm120)
  • Actual VRAM Usage: ~35GB with gpu_memory_utilization: 0.4 (full 256K context)
  • CUDA Version: cu130 (CUDA 13.0)
  • vLLM Version: 0.18.2+ (tested on vllm/vllm-openai:gemma4-cu130 Docker image)

Note: NVFP4A16 is a weight-only quantization format that preserves activations in BF16. The vision tower remains in BF16, but the quantized text decoder enables deployment on high-end GPUs with substantial headroom for KV cache.

Quantization Recipe

This model was quantized using the following configuration:

default_stage:
  default_modifiers:
    GPTQModifier:
      targets: [Linear]
      ignore:
        - lm_head
        - model.vision_tower.*
        - model.embed_vision.*
        - model.multi_modal_projector.*
      scheme: NVFP4A16
      block_size: 128
      dampening_frac: 0.01
      actorder: static
      offload_hessians: false
      sequential_targets: [Gemma4TextDecoderLayer]

Calibration Dataset: Mixed dataset including:

  • ise-uiuc/Magicoder-Evol-Instruct-110K (1024 samples)
  • Salesforce/APIGen-MT-5k (512 samples)
  • nvidia/When2Call (512 samples)

Total: 2048 samples at 2048 tokens each (~4M tokens calibration set)

Usage

Loading with Transformers

from transformers import AutoModelForMultimodalLM, AutoTokenizer

model_id = "ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

Loading with vLLM

Example vLLM Configuration (adapted from actual deployment):

# vllm_config.yaml
model: ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ
quantization: compressed-tensors  # NVFP4A16 weights
kv_cache_dtype: fp8_e4m3          # FP8 KV cache
gpu_memory_utilization: 0.4       # ~35GB VRAM usage on RTX PRO 6000; can be increased up to 0.95 if needed
max_model_len: 262144             # 256K context
tensor_parallel_size: 1           # Single GPU
enable_prefix_caching: true
enable_chunked_prefill: true

# Gemma4-specific (REQUIRED for tool calling)
enable_auto_tool_choice: true
tool_call_parser: gemma4          
reasoning_parser: gemma4

Docker Deployment

docker run --rm -it \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --ipc=host \
  --network=host \
  --privileged \
  --shm-size=16g \
  -v /root/.cache/huggingface/hub:/root/.cache/huggingface/hub \
  -v /path/to/vllm_config.yaml:/vllm_config.yaml \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e CUDA_VISIBLE_DEVICES=0 \
  -e HF_HOME=/root/.cache/huggingface \
  -e HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface/hub \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e SAFETENSORS_FAST_GPU=1 \
  -e VLLM_TARGET_DEVICE=cuda \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e TOKENIZERS_PARALLELISM=true \
  vllm/vllm-openai:gemma4-cu130 \
  --config /vllm_config.yaml

Files in This Repository

File Description
model.safetensors Quantized model weights (~20GB, single file)
config.json Model configuration with quantization_config
tokenizer.json Tokenizer vocabulary
tokenizer_config.json Tokenizer config with chat template
chat_template.jinja Gemma4 native chat template
generation_config.json Generation parameters
processor_config.json Processor configuration
recipe.yaml Quantization recipe
LICENSE Apache 2.0 License

License

This quantization is released under the Apache 2.0 License.

The base model google/gemma-4-31B-it is also licensed under Apache 2.0.

See LICENSE for the full text.

Citation

If you use this model in your research, please cite:

@misc{gemma4-31b-nvfp4-quantization,
  title = {Gemma-4-31B-it NVFP4A16 Quantization},
  author = {ebircak},
  year = {2026},
  howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ}},
  note = {Quantized with llm-compressor 0.10.0.1}
}

Disclaimer

This is a community quantization of the Google Gemma-4-31B-it model. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.

This quantization would not be possible without the hardware support of Gratex International, a.s. (https://www.gratex.com).

Model Card Version

This model card follows the Model Cards for Model Reporting standard.


Original Model: google/gemma-4-31B-it
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors

Downloads last month
372
Safetensors
Model size
18B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ

Quantized
(142)
this model

Paper for ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ