Q3k_xl quant

#1
by NIK2703 - opened

Please also create a Q3K_XL quantized version suitable for use on a PC with 16 GB of RAM

Hi, I created a quantized Q3_K_XL version, it ended up being 13.8GB โ€” the same size as Unsloth. If you have a GPU and run Llama.cpp with Q4_K_XL, keeping the backbone on the GPU and moving the experts to CPU/RAM, this will allow for higher accuracy.

I have a GPU with 4GB VRAM, and I'm using the LM Studio option "force MoE weights on CPU"โ€”this provides greater acceleration than offloading some layers, and besides, layer offloading still doesn't allow me to run q4_k_xl

This comment has been hidden

In your setup, the Q3_K_XL version is definitely the better choice. I tested it a bit and it performs quite well for its size. Personally, I use Q4_K_XL with llama.cpp on my daily setup, but it quickly eats up VRAM with large context windows, even when the experts are offloaded to RAM.

Sign up or log in to comment