Qwen3.5-4B runs at 50% speed of Qwen3-4B on my machine

by PurelySelfMade - opened Mar 2

Mar 2

Is anyone experiencing the same behavior?

Specs:

Model Name: MacBook Air
Model Identifier: Mac14,2
Model Number: MC7U4ZE/A
Chip: Apple M2
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB

Qwen3:4b (non-thinking mode)

llama-server \
  -hf ggml-org/Qwen3-4B-GGUF:Q4_K_M \
  --jinja \
  -c 4096 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0 \
  -ngl 99 \
  --chat-template-kwargs "{\"enable_thinking\": false}"

Result: 28 tk/s

Qwen3.5:4b (non-thinking by default in the current version of 4B quant)

llama-server \
  -hf unsloth/Qwen3.5-4B-GGUF:Q4_K_M \
  --jinja \
  -c 4096 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0 \
  -ngl 99

Result: 13 tk/s

Is this expected?

John-Sa

Mar 2

•

edited Mar 2

Yes I'm having the same problem too.
Right now I'm experimenting with UD-IQ3 ggufs because even the Q4_K_M and Q3_K_M are not up to par with Qwen3 Q4_K_M in terms of token generation and prompt processing speeds.
Let's see how it goes.
Edit: Yea, something is fundamentally wrong with all the GGUFS for this model Qwen3.5-4B. All of them have near same token gen speed which is really really slow compared to Qwen3-4B GGUFs. I think the conversion to GGUF step might have gone wrong? Qwen3-VL is also better and faster than both Qwen3-4B and Qwen3.5-4B in processing and generating tokens.

NIK2703

Mar 2

same issue with the iq4_nl quant, but only on the GTX 1050 Ti (14 t/s); on the mobile GTX 1650, performance is just as fast as on the previous model (28 t/s).

PurelySelfMade

Mar 2

I tested bartowski's 4B variant and exact same performance drop

Kazuma0123

Mar 3

@PurelySelfMade Try setting --parallel 1 with llama-server. That fixed it for me

PurelySelfMade

Mar 3

@PurelySelfMade Try setting --parallel 1 with llama-server. That fixed it for me

This is unrelated for a basic, single request test

Kazuma0123

Mar 4

@PurelySelfMade It's definitely related, I tested it with 2 devices and in both cases it significantly increased token generation, even on the first request.

PurelySelfMade

Mar 4

@PurelySelfMade It's definitely related, I tested it with 2 devices and in both cases it significantly increased token generation, even on the first request.

Kazuma0123

Mar 4

@PurelySelfMade @John-Sa So can you guys try it? On vulkan backends it fixes this issue, I don't know why exactly

BreadMaster101

Mar 5

Setting --parallel to 1 also fixed it for me, I'm using CUDA

DiffusionFanatic1

Mar 7

I think part of this is that Qwen3.5 4B is clearly not ACTUALLY 4B the way Qwen3.0 4B was, it's more like 5B, dunno why they did this, maybe to boost bench scores or something

John-Sa

Mar 7

@PurelySelfMade @John-Sa So can you guys try it? On vulkan backends it fixes this issue, I don't know why exactly

@Kazuma0123 Hey! Thanks for the advice I haven't tried using Vulkan backends but I've tried using the "--parallel" command flag and it increases token generation by only 2%. I still think the problem lies in the way it was trained, I'll keep using Qwen 3-4b for now.

PurelySelfMade

Mar 7

--parallel 1 doesn't have an effect on Macs, most likely because of unified memory - there is nothing to optimize

Default value is -1 (auto)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment