Currently in production on private service.

#1
by tcclaviger - opened

Currently using this on 2xR9700 server as a test serving a small community of SWEs with agentic workflows and custom tools added via SSE and REST API, seems to work great.

One note on power draw: Recent changes to the rocm nightly have resulted in RDNA 4 ignoring power limits, my cards routinely spike to 400 watts despite limits set to 250. Keep an eye on temps.

Concurrently serves up to 8 users at high prefill rates, over 11,000 tps capable. Decode tends to be 30-50 for single user, but multi user doesn't scale as well as the FP8 version does. Using the below commands.

-enforce-eager = performance on RDNA 4 is currently better with this flag, cuda graphs work but are slower in prefill/decode.

  • max-num-batched-tokens has little effect above 2048 but lowering it reduces throughput on prefill.

  • 12.0.1 override may not be necessary anymore, previously it was trying to run the 9700 as gfx1100 without this

    --shm-size=16gb
    -e HSA_OVERRIDE_GFX_VERSION=12.0.1
    -e TORCH_COMPILE_DISABLE=1
    -v /mnt/raid/Models/GPTQ/Qwen3-Coder-42B-A3B-Instruct-GPTQ-Int4-gs128-AMD-COMPATIBLE:/app/models
    rocm/vllm-dev:nightly
    vllm serve /app/models
    --tensor-parallel 2
    --quantization compressed-tensors
    --enforce-eager
    --enable-prefix-caching
    --enable-auto-tool-choice
    --tool-call-parser hermes
    --max-num-seqs 8
    --max-model-len 262144
    --enable-chunked-prefill
    --max-num-batched-tokens 2048
    --gpu-memory-utilization 0.92
    --override-generation-config '{"temperature": 0.7, "top_p": 0.8, "top_k": 20, "repetition_penalty": 1.1}'

tcclaviger changed discussion title from In Prod to Currently in production on private service.

Sign up or log in to comment