Instructions to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF",
	filename="Apriel-1.5-15b-Thinker-IQ4_NL-EQKOUD-IQ4NL-H-MXFP4.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
# Run inference directly in the terminal:
./llama-cli -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Use Docker

docker model run hf.co/magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

LM Studio
Jan

vLLM

How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Ollama
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Ollama:
```
ollama run hf.co/magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
```

Unsloth Studio

How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF to start chatting

How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Run Hermes

hermes

Docker Model Runner
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Docker Model Runner:
```
docker model run hf.co/magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
```

Lemonade

How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL

Run and chat with the model

lemonade run user.Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF-IQ4_NL

List all available models

lemonade list

MagicQuant for Apriel 1.6?

by sebastienbo - opened Jan 25

Discussion

sebastienbo

Jan 25

•

edited Jan 25

Hi,
Since Apriel 1.6 was hitting all the most important benchmarks, I was wondering if you could make a MagicQuant version of this one? According to the top 10 of best small opensource models that can run on laptops Apriel 1.6 and GLM 4.7 flash are the TOP of the intelligence line right now.
https://artificialanalysis.ai/models/open-source/small
The GLM 4.7 flash 30B requires double of the VRAM versus Apriel 1.6 (15B).

It would be great if we could have a magic quants version of both, but especially apriel 1.6, because that one is really slow in Q6 or Q8. While the Q4 is hallucinating to much in thinking mode. And Apriel 1.6 is the only one that can run on most consumer hardware without Super expensive GPU's. The only problem is speed vs quality. If we go Q4 it's unsuably bad for coding, while q6 is the best for that size, but extreemly slow at 4 tokens per second ...

Unsloth also published a new methodology that makes the process much faster and with less vram requirements:
https://unsloth.ai/docs/new/3x-faster-training-packing

Here is how you can start:
https://unsloth.ai/docs/basics/quantization-aware-training-qat

I would really appreciate a magic quantz ;-)
And you'll also probably will benefit of that one :-)

Delcos

Apr 2

pls do this

floory

Apr 10

seems like he's done

magiccodingman

Owner Apr 13

If it means anything I am still working on the project. Just busy with life and only able to put a few hours in here or there, but I got some good work on it over the weekend. I'm not using the old pipeline I built anymore. It takes a lot of time, there's flaws that're blatant to me, and I'm working very hard on version 2 which is built in a totally new framework and language.

But the original MagicQuant results truly were just my prototype. Plus, when the old pipeline runs, my entire PC is basically frozen until it's done and it can take days or weeks. So to be honest, I'm trying to make a proper code base that I can not only trust, have it be more perform ant, achieve better results, but also something I'd be comfortable releasing as an open source project. Because I don't really want to be the bottleneck of why people can't build MagicQuant models.

I am hoping to have a lot of the new Qwen3.5 and Gemma models made into the version 2 MagicQuant quantizations by the end of April or may. Then after hammering out the last of the details, I want to just release the code and let the community do with it as they want.

It makes me feel bad when people ask for models and I can't help. Because only my primary workstation can run the old pipeline and I can't have my PC hang for days or weeks when I have work to do. I work from home, so I need my main PC daily. All my previous MagicQuant models baked when I took leave.

floory

Apr 15

take your time, it's just kinda been radio silent from huggingface but that's the only place i follow you

magiccodingman

Owner Apr 15

Thanks! On my GitHub here:
https://github.com/magiccodingman/MagicQuant-Wiki

I am trying to be a bit more active. Version 2 is taking a completely different direction. I learned a lot from version 1. And with KL Divergence now a benchmark, there's more nuance to how models are chosen. Previous models that were clear winners are not always clear winners anymore. Plus with a whole new philosophy of how to target and find the best quants has resulted in some really weird, but cool things. But I'm still sitting on it, digesting it, etc.

The hardest parts of my new framework are done. I'm now just tuning it and deciding on multiple factors now.

magiccodingman

Owner Apr 25

@sebastienbo @floory Please refer to the upcoming v2.0 launch. All v1.0 MagicQuant models will be depreciated. Original MagicQuant models act deceptively smart, but had major unmeasured flaws. New v2.0 not only resolves this, but introduces a new system that's fundamentally different. Keep an eye on the wiki and collection in the repo over the next week or 2. I may even silently launch a couple examples while testing. But v2.0 isn't just better, it's trustworthy which is way more important. I'm also blending in Unsloth Dynamic learned quants into tensor groups now too. So, the new release will be much more fun and production grade. Plus with a system that's significantly more trust worthy and not producing hidden damage either, I can just easily build additional model architectures as well.

I'm literally in the final stages right now of cleaning up v2.0 and documenting. It's already built on the back end, I'm merely cleaning up the code, output, and so on. After posting some small 4B test models, I'll be starting with the Qwen3.6 series, but will have the ability to work with way more now 👍

magiccodingman

Owner Apr 26

v2.0 isn't fully ready for showcasing, but if anyone is still interested:
https://huggingface.co/magiccodingman/Qwen3-4B-Instruct-2507-Unsloth-MagicQuant-GGUF
That's the first showcase of MagicQuant v2.0

The wiki has been updated:
https://github.com/magiccodingman/MagicQuant-Wiki

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment