Quantization

#5
by NukeNotNull - opened

What quantization does the chat.z.ai site use?

What quantization does the chat.z.ai site use?

Its deff gotta be lower than FP8, or just using really shitty KV.
Model performance just completely degrades after 100k tokens on there platform, same with glm 5, yet other providers don't have the same issues.

z.ai still hasn't addressed this like at all.

Stop whining already πŸ˜† I'm glad that it's OSS! And it really awesome model (I'm using unsloth's quantization).

Z.ai's dialogue platform does not provide GLM-5.1 services. Are you referring to the API?

Z.ai's dialogue platform does not provide GLM-5.1 services. Are you referring to the API?

Coding plan.
USers have been complaining about its output for awhile.
After around 100k tokens GLM completely looses its mind and spits out nonsense

ppl-GLM-5.1

ik_llama.cpp offers the best quality quantizations as well as speed (especially for hybrid CPU+GPU inference) . i'd definitely recommend check out ubergarm/GLM-5.1-GGUF @AImhotep . As you can see the lower perplexity (better quality) as compared to some unreleased mainline compatible test quants that I created.

fwiw i've tested out to ~65k context and it is working fine with opencode for basic vibe coding quite well. i haven't gone further as it slows down quite a bit as i'm running it CPU-only (no GPUs at all hahah)...

I'm working with unsloth's IQ2_XSS on llama cpp, which is working amazingly well (model itself can be bit verbose tho :)
When using stock llama (newest possible) and MTP config I can get ~300t/s PP and 34 - 40t/s output (continously) - perfectly usable with Roo Code

Perplexity alone is not that good marker. Any KLD graphs?
Also I have 304GB of VRAM so I'm stuck with q2 for now.

@AImhotep

Sure, you can see my quants forming the pareto front for mean KLD calculated at 8k context using a special corpus with 16k chunk blurbs for better alignment:

KLD Chart

I added KLD logs, script, for your reference: https://huggingface.co/ubergarm/GLM-5.1-GGUF/tree/main/logs/kld

What arguments are you using for MTP, I'd like to try that out. Thanks!

Cheers!

@ubergarm

My MTP line:
--spec-type ngram-map-k4v --spec-ngram-size-n 12 --spec-ngram-size-m 16 --spec-ngram-min-hits 1 --draft-max 32

It's particularly useful with glmp-5.1 as it reasoning intensly. Almost no gain in reasoning phase but a lot of accepted generated tokens overall:
draft acceptance rate = 0.63726 ( 1741 accepted / 2732 generated)

I wonder how KLD looks like in vanila llama.cpp in comparison.

@AImhotep

My MTP line:

Thanks for the tip! I started using ngram-map-k4v recently and it does seem to speed up TG on some models like new MiniMax-M2.7 and GLM-5.1. That seems like a very good acceptance ratek, I'll try you numbers!

I wonder how KLD looks like in vanila llama.cpp in comparison.

It should look the same. The OG quantization types are the same, and much of the original llama-perplexity command (used for KLD as well) was written by ik: https://github.com/ggml-org/llama.cpp/blame/master/tools/perplexity/perplexity.cpp

If anything, there might be an overall shift up or down slightly depending on which backend you're running on, but the trends will be the same.

But you can't run the newer SOTA quantization types on mainline unfortunately due to some old beef: https://github.com/ggml-org/llama.cpp/pull/19726#issuecomment-3946484059

@ubergarm

But you can't run the newer SOTA quantization types on mainline unfortunately due to some old beef: https://github.com/ggml-org/llama.cpp/pull/19726#issuecomment-3946484059

Ugh... yeah.

Sign up or log in to comment