Currently in production on private service.
Currently using this on 2xR9700 server as a test serving a small community of SWEs with agentic workflows and custom tools added via SSE and REST API, seems to work great.
One note on power draw: Recent changes to the rocm nightly have resulted in RDNA 4 ignoring power limits, my cards routinely spike to 400 watts despite limits set to 250. Keep an eye on temps.
Concurrently serves up to 8 users at high prefill rates, over 11,000 tps capable. Decode tends to be 30-50 for single user, but multi user doesn't scale as well as the FP8 version does. Using the below commands.
-enforce-eager = performance on RDNA 4 is currently better with this flag, cuda graphs work but are slower in prefill/decode.
max-num-batched-tokens has little effect above 2048 but lowering it reduces throughput on prefill.
12.0.1 override may not be necessary anymore, previously it was trying to run the 9700 as gfx1100 without this
--shm-size=16gb
-e HSA_OVERRIDE_GFX_VERSION=12.0.1
-e TORCH_COMPILE_DISABLE=1
-v /mnt/raid/Models/GPTQ/Qwen3-Coder-42B-A3B-Instruct-GPTQ-Int4-gs128-AMD-COMPATIBLE:/app/models
rocm/vllm-dev:nightly
vllm serve /app/models
--tensor-parallel 2
--quantization compressed-tensors
--enforce-eager
--enable-prefix-caching
--enable-auto-tool-choice
--tool-call-parser hermes
--max-num-seqs 8
--max-model-len 262144
--enable-chunked-prefill
--max-num-batched-tokens 2048
--gpu-memory-utilization 0.92
--override-generation-config '{"temperature": 0.7, "top_p": 0.8, "top_k": 20, "repetition_penalty": 1.1}'