vllm / sglang support?

#4
by mtcl - opened

is there a support for sglang/vllm?

vLLM are working on it as per their Github.

I think the PR is just merged an hour ago.

i hope official instructions are updated in the docs soon here.

Here is a custom vllm image I've built. It works as intended: https://hub.docker.com/r/infantryman77/vllm-gemma4. Tested with Cline and Open-Webui. Not completely production ready for it works.

services:
  vllm:
    image: infantryman77/vllm-gemma4:nightly-20260402
    container_name: gemma4
    command:
      - /models/gemma-4-31B-it-AWQ-8bit
      - --served-model-name
      - gemma4-31b
      - --max-model-len
      - "131072"
      - --tensor-parallel-size
      - "4"
      - --gpu-memory-utilization
      - "0.97"
      - --reasoning-parser
      - gemma4
      - --enable-auto-tool-choice
      - --tool-call-parser
      - gemma4
      - --host
      - 0.0.0.0
      - --limit-mm-per-prompt
      - '{"image":4}'
      - --max-num-batched-tokens
      - "2096"
      - --max-num-seqs
      - "4"
      - --port
      - "8080"
      - --disable-custom-all-reduce
      - --override-generation-config
      - '{"temperature":1.0,"top_p":0.95,"top_k":64}'
    volumes:
      - /home/infantryman/vllm/models:/models
    ports:
      - "8080:8080"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - PYTORCH_ALLOC_CONF=expandable_segments:True
      - LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64
      - OMP_NUM_THREADS=1
      - PYTHONWARNINGS=ignore::FutureWarning
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    ipc: host
    restart: unless-stopped

Official VLLM doc on how to use gemma 4 models: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

is SGLang support available?

Hi all,

Yes, both vLLM and SGLang offer official support for Gemma 4. Just make sure you're running the latest versions of these frameworks to handle the new model architecture and tokenizers correctly. For implementation details and setup, you can check out these official resources:

vLLM official guide: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
Gemma 4 Optimized Support & NVFP4 Integration: https://github.com/sgl-project/sglang/issues/22129

Sign up or log in to comment