vllm / sglang support?
is there a support for sglang/vllm?
vLLM are working on it as per their Github.
I think the PR is just merged an hour ago.
i hope official instructions are updated in the docs soon here.
Here is a custom vllm image I've built. It works as intended: https://hub.docker.com/r/infantryman77/vllm-gemma4. Tested with Cline and Open-Webui. Not completely production ready for it works.
services:
vllm:
image: infantryman77/vllm-gemma4:nightly-20260402
container_name: gemma4
command:
- /models/gemma-4-31B-it-AWQ-8bit
- --served-model-name
- gemma4-31b
- --max-model-len
- "131072"
- --tensor-parallel-size
- "4"
- --gpu-memory-utilization
- "0.97"
- --reasoning-parser
- gemma4
- --enable-auto-tool-choice
- --tool-call-parser
- gemma4
- --host
- 0.0.0.0
- --limit-mm-per-prompt
- '{"image":4}'
- --max-num-batched-tokens
- "2096"
- --max-num-seqs
- "4"
- --port
- "8080"
- --disable-custom-all-reduce
- --override-generation-config
- '{"temperature":1.0,"top_p":0.95,"top_k":64}'
volumes:
- /home/infantryman/vllm/models:/models
ports:
- "8080:8080"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- PYTORCH_ALLOC_CONF=expandable_segments:True
- LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
- OMP_NUM_THREADS=1
- PYTHONWARNINGS=ignore::FutureWarning
- VLLM_WORKER_MULTIPROC_METHOD=spawn
ipc: host
restart: unless-stopped
Official VLLM doc on how to use gemma 4 models: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
is SGLang support available?
Hi all,
Yes, both vLLM and SGLang offer official support for Gemma 4. Just make sure you're running the latest versions of these frameworks to handle the new model architecture and tokenizers correctly. For implementation details and setup, you can check out these official resources:
vLLM official guide: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
Gemma 4 Optimized Support & NVFP4 Integration: https://github.com/sgl-project/sglang/issues/22129