Q3k_xl quant

by NIK2703 - opened Sep 30, 2025

Discussion

NIK2703

Sep 30, 2025

Please also create a Q3K_XL quantized version suitable for use on a PC with 16 GB of RAM

rodrigomt

Owner Oct 1, 2025

Hi, I created a quantized Q3_K_XL version, it ended up being 13.8GB — the same size as Unsloth. If you have a GPU and run Llama.cpp with Q4_K_XL, keeping the backbone on the GPU and moving the experts to CPU/RAM, this will allow for higher accuracy.

NIK2703

Oct 2, 2025

I have a GPU with 4GB VRAM, and I'm using the LM Studio option "force MoE weights on CPU"—this provides greater acceleration than offloading some layers, and besides, layer offloading still doesn't allow me to run q4_k_xl

NIK2703

Oct 3, 2025

This comment has been hidden

rodrigomt

Owner Oct 6, 2025

In your setup, the Q3_K_XL version is definitely the better choice. I tested it a bit and it performs quite well for its size. Personally, I use Q4_K_XL with llama.cpp on my daily setup, but it quickly eats up VRAM with large context windows, even when the experts are offloaded to RAM.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment