Qwen3.5-4B runs at 50% speed of Qwen3-4B on my machine

#4
by PurelySelfMade - opened

Is anyone experiencing the same behavior?

Specs:

Model Name: MacBook Air
Model Identifier: Mac14,2
Model Number: MC7U4ZE/A
Chip: Apple M2
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB

Qwen3:4b (non-thinking mode)

llama-server \
  -hf ggml-org/Qwen3-4B-GGUF:Q4_K_M \
  --jinja \
  -c 4096 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0 \
  -ngl 99 \
  --chat-template-kwargs "{\"enable_thinking\": false}"

Result: 28 tk/s

Qwen3.5:4b (non-thinking by default in the current version of 4B quant)

llama-server \
  -hf unsloth/Qwen3.5-4B-GGUF:Q4_K_M \
  --jinja \
  -c 4096 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0 \
  -ngl 99

Result: 13 tk/s

Is this expected?

Yes I'm having the same problem too.
Right now I'm experimenting with UD-IQ3 ggufs because even the Q4_K_M and Q3_K_M are not up to par with Qwen3 Q4_K_M in terms of token generation and prompt processing speeds.
Let's see how it goes.
Edit: Yea, something is fundamentally wrong with all the GGUFS for this model Qwen3.5-4B. All of them have near same token gen speed which is really really slow compared to Qwen3-4B GGUFs. I think the conversion to GGUF step might have gone wrong? Qwen3-VL is also better and faster than both Qwen3-4B and Qwen3.5-4B in processing and generating tokens.

same issue with the iq4_nl quant, but only on the GTX 1050 Ti (14 t/s); on the mobile GTX 1650, performance is just as fast as on the previous model (28 t/s).

I tested bartowski's 4B variant and exact same performance drop

@PurelySelfMade Try setting --parallel 1 with llama-server. That fixed it for me

@PurelySelfMade Try setting --parallel 1 with llama-server. That fixed it for me

This is unrelated for a basic, single request test

@PurelySelfMade It's definitely related, I tested it with 2 devices and in both cases it significantly increased token generation, even on the first request.

@PurelySelfMade It's definitely related, I tested it with 2 devices and in both cases it significantly increased token generation, even on the first request.

s6j0k4ghbuea1

@PurelySelfMade @John-Sa So can you guys try it? On vulkan backends it fixes this issue, I don't know why exactly

Setting --parallel to 1 also fixed it for me, I'm using CUDA

I think part of this is that Qwen3.5 4B is clearly not ACTUALLY 4B the way Qwen3.0 4B was, it's more like 5B, dunno why they did this, maybe to boost bench scores or something

@PurelySelfMade @John-Sa So can you guys try it? On vulkan backends it fixes this issue, I don't know why exactly

@Kazuma0123 Hey! Thanks for the advice I haven't tried using Vulkan backends but I've tried using the "--parallel" command flag and it increases token generation by only 2%. I still think the problem lies in the way it was trained, I'll keep using Qwen 3-4b for now.

--parallel 1 doesn't have an effect on Macs, most likely because of unified memory - there is nothing to optimize

Default value is -1 (auto)

Sign up or log in to comment