Gemma-4-31B-it NVFP4A16 Quantization (Weight-only FP4)

This is a NVFP4A16 quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for deployment on NVIDIA Blackwell GPUs with vLLM.

Model Details

Property	Value
Base Model	google/gemma-4-31B-it
Quantization Method	NVFP4A16 (weight-only FP4)
Weight Precision	FP4 (4-bit floating point)
Activation Precision	BF16 (weight-only quantization)
Group Size	16
Quantization Library	llm-compressor 0.10.0.1
Format	compressed-tensors (nvfp4-pack-quantized)
Architecture	Gemma4ForConditionalGeneration
Layers	60 decoder layers
Hidden Size	5376
Context Window	256K tokens
Vision Tower	SigLIP (27 layers, preserved in BF16)
Quantized Components	Text decoder + projector (vision tower preserved in BF16)

Hardware Requirements

Verified Deployment: NVIDIA RTX PRO 6000 (96GB VRAM, Blackwell sm120)
Actual VRAM Usage: ~35GB with gpu_memory_utilization: 0.4 (full 256K context)
CUDA Version: cu130 (CUDA 13.0)
vLLM Version: 0.18.2+ (tested on vllm/vllm-openai:gemma4-cu130 Docker image)

Note: NVFP4A16 is a weight-only quantization format that preserves activations in BF16. The vision tower remains in BF16, but the quantized text decoder enables deployment on high-end GPUs with substantial headroom for KV cache.

Quantization Recipe

This model was quantized using the following configuration:

default_stage:
  default_modifiers:
    GPTQModifier:
      targets: [Linear]
      ignore:
        - lm_head
        - model.vision_tower.*
        - model.embed_vision.*
        - model.multi_modal_projector.*
      scheme: NVFP4A16
      block_size: 128
      dampening_frac: 0.01
      actorder: static
      offload_hessians: false
      sequential_targets: [Gemma4TextDecoderLayer]

Calibration Dataset: Mixed dataset including:

ise-uiuc/Magicoder-Evol-Instruct-110K (1024 samples)
Salesforce/APIGen-MT-5k (512 samples)
nvidia/When2Call (512 samples)

Total: 2048 samples at 2048 tokens each (~4M tokens calibration set)

Usage

Loading with Transformers

from transformers import AutoModelForMultimodalLM, AutoTokenizer

model_id = "ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

Loading with vLLM

Example vLLM Configuration (adapted from actual deployment):

# vllm_config.yaml
model: ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ
quantization: compressed-tensors  # NVFP4A16 weights
kv_cache_dtype: fp8_e4m3          # FP8 KV cache
gpu_memory_utilization: 0.4       # ~35GB VRAM usage on RTX PRO 6000; can be increased up to 0.95 if needed
max_model_len: 262144             # 256K context
tensor_parallel_size: 1           # Single GPU
enable_prefix_caching: true
enable_chunked_prefill: true

# Gemma4-specific (REQUIRED for tool calling)
enable_auto_tool_choice: true
tool_call_parser: gemma4          
reasoning_parser: gemma4

Docker Deployment

docker run --rm -it \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --ipc=host \
  --network=host \
  --privileged \
  --shm-size=16g \
  -v /root/.cache/huggingface/hub:/root/.cache/huggingface/hub \
  -v /path/to/vllm_config.yaml:/vllm_config.yaml \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e CUDA_VISIBLE_DEVICES=0 \
  -e HF_HOME=/root/.cache/huggingface \
  -e HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface/hub \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e SAFETENSORS_FAST_GPU=1 \
  -e VLLM_TARGET_DEVICE=cuda \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e TOKENIZERS_PARALLELISM=true \
  vllm/vllm-openai:gemma4-cu130 \
  --config /vllm_config.yaml

Files in This Repository

File	Description
`model.safetensors`	Quantized model weights (~20GB, single file)
`config.json`	Model configuration with quantization_config
`tokenizer.json`	Tokenizer vocabulary
`tokenizer_config.json`	Tokenizer config with chat template
`chat_template.jinja`	Gemma4 native chat template
`generation_config.json`	Generation parameters
`processor_config.json`	Processor configuration
`recipe.yaml`	Quantization recipe
`LICENSE`	Apache 2.0 License

License

This quantization is released under the Apache 2.0 License.

The base model google/gemma-4-31B-it is also licensed under Apache 2.0.

See LICENSE for the full text.

Citation

If you use this model in your research, please cite:

@misc{gemma4-31b-nvfp4-quantization,
  title = {Gemma-4-31B-it NVFP4A16 Quantization},
  author = {ebircak},
  year = {2026},
  howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ}},
  note = {Quantized with llm-compressor 0.10.0.1}
}

Disclaimer

This is a community quantization of the Google Gemma-4-31B-it model. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.

This quantization would not be possible without the hardware support of Gratex International, a.s. (https://www.gratex.com).

Model Card Version

This model card follows the Model Cards for Model Reporting standard.

Original Model: google/gemma-4-31B-it
Quantization Tool: llm-compressor
Quantization Format: compressed-tensors

Downloads last month: 372

Safetensors

Model size

18B params

Tensor type

F32

BF16

F8_E4M3

Model tree for ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ

Base model

google/gemma-4-31B-it

Quantized

(142)

this model

Paper for ebircak/gemma-4-31B-it-4bit-NVFP4A16-GPTQ

Model Cards for Model Reporting

Paper • 1810.03993 • Published Oct 5, 2018 • 7