Gemma-4-31B-it NVFP4A16 Quantization (Weight-only FP4)
This is a NVFP4A16 quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for deployment on NVIDIA Blackwell GPUs with vLLM.
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Quantization Method | NVFP4A16 (weight-only FP4) |
| Weight Precision | FP4 (4-bit floating point) |
| Activation Precision | BF16 (weight-only quantization) |
| Group Size | 16 |
| Quantization Library | llm-compressor 0.10.0.1 |
| Format | compressed-tensors (nvfp4-pack-quantized) |
| Architecture | Gemma4ForConditionalGeneration |
| Layers | 60 decoder layers |
| Hidden Size | 5376 |
| Context Window | 256K tokens |
| Vision Tower | SigLIP (27 layers, preserved in BF16) |
| Quantized Components | Text decoder + projector (vision tower preserved in BF16) |
Hardware Requirements
- Verified Deployment: NVIDIA RTX PRO 6000 (96GB VRAM, Blackwell sm120)
- Actual VRAM Usage: ~35GB with
gpu_memory_utilization: 0.4(full 256K context) - CUDA Version: cu130 (CUDA 13.0)
- vLLM Version: 0.18.2+ (tested on vllm/vllm-openai:gemma4-cu130 Docker image)
Note: NVFP4A16 is a weight-only quantization format that preserves activations in BF16. The vision tower remains in BF16, but the quantized text decoder enables deployment on high-end GPUs with substantial headroom for KV cache.
Quantization Recipe
This model was quantized using the following configuration:
default_stage:
default_modifiers:
GPTQModifier:
targets: [Linear]
ignore:
- lm_head
- model.vision_tower.*
- model.embed_vision.*
- model.multi_modal_projector.*
scheme: NVFP4A16
block_size: 128
dampening_frac: 0.01
actorder: static
offload_hessians: false
sequential_targets: [Gemma4TextDecoderLayer]
Calibration Dataset: Mixed dataset including:
- ise-uiuc/Magicoder-Evol-Instruct-110K (1024 samples)
- Salesforce/APIGen-MT-5k (512 samples)
- nvidia/When2Call (512 samples)
Total: 2048 samples at 2048 tokens each (~4M tokens calibration set)
Usage
Loading with Transformers
from transformers import AutoModelForMultimodalLM, AutoTokenizer
model_id = "ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
Loading with vLLM
Example vLLM Configuration (adapted from actual deployment):
# vllm_config.yaml
model: ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ
quantization: compressed-tensors # NVFP4A16 weights
kv_cache_dtype: fp8_e4m3 # FP8 KV cache
gpu_memory_utilization: 0.4 # ~35GB VRAM usage on RTX PRO 6000; can be increased up to 0.95 if needed
max_model_len: 262144 # 256K context
tensor_parallel_size: 1 # Single GPU
enable_prefix_caching: true
enable_chunked_prefill: true
# Gemma4-specific (REQUIRED for tool calling)
enable_auto_tool_choice: true
tool_call_parser: gemma4
reasoning_parser: gemma4
Docker Deployment
docker run --rm -it \
--runtime=nvidia \
--gpus '"device=0"' \
--ipc=host \
--network=host \
--privileged \
--shm-size=16g \
-v /root/.cache/huggingface/hub:/root/.cache/huggingface/hub \
-v /path/to/vllm_config.yaml:/vllm_config.yaml \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e CUDA_VISIBLE_DEVICES=0 \
-e HF_HOME=/root/.cache/huggingface \
-e HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface/hub \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-e SAFETENSORS_FAST_GPU=1 \
-e VLLM_TARGET_DEVICE=cuda \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-e TOKENIZERS_PARALLELISM=true \
vllm/vllm-openai:gemma4-cu130 \
--config /vllm_config.yaml
Files in This Repository
| File | Description |
|---|---|
model.safetensors |
Quantized model weights (~20GB, single file) |
config.json |
Model configuration with quantization_config |
tokenizer.json |
Tokenizer vocabulary |
tokenizer_config.json |
Tokenizer config with chat template |
chat_template.jinja |
Gemma4 native chat template |
generation_config.json |
Generation parameters |
processor_config.json |
Processor configuration |
recipe.yaml |
Quantization recipe |
LICENSE |
Apache 2.0 License |
License
This quantization is released under the Apache 2.0 License.
The base model google/gemma-4-31B-it is also licensed under Apache 2.0.
See LICENSE for the full text.
Citation
If you use this model in your research, please cite:
@misc{gemma4-31b-nvfp4-quantization,
title = {Gemma-4-31B-it NVFP4A16 Quantization},
author = {ebircak},
year = {2026},
howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ}},
note = {Quantized with llm-compressor 0.10.0.1}
}
Disclaimer
This is a community quantization of the Google Gemma-4-31B-it model. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.
This quantization would not be possible without the hardware support of Gratex International, a.s. (https://www.gratex.com).
Model Card Version
This model card follows the Model Cards for Model Reporting standard.
Original Model: google/gemma-4-31B-it
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors
- Downloads last month
- 372
Model tree for ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ
Base model
google/gemma-4-31B-it