Instructions to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF", filename="Apriel-1.5-15b-Thinker-IQ4_NL-EQKOUD-IQ4NL-H-MXFP4.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL # Run inference directly in the terminal: llama-cli -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL # Run inference directly in the terminal: llama-cli -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL # Run inference directly in the terminal: ./llama-cli -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL # Run inference directly in the terminal: ./build/bin/llama-cli -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
Use Docker
docker model run hf.co/magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
- LM Studio
- Jan
- vLLM
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
- Ollama
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Ollama:
ollama run hf.co/magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
- Unsloth Studio
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF to start chatting
- Pi
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
Run Hermes
hermes
- Docker Model Runner
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Docker Model Runner:
docker model run hf.co/magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
- Lemonade
How to use magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull magiccodingman/Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF:IQ4_NL
Run and chat with the model
lemonade run user.Apriel-1.5-15b-Thinker-unsloth-MagicQuant-Hybrid-GGUF-IQ4_NL
List all available models
lemonade list
MagicQuant for Apriel 1.6?
Hi,
Since Apriel 1.6 was hitting all the most important benchmarks, I was wondering if you could make a MagicQuant version of this one? According to the top 10 of best small opensource models that can run on laptops Apriel 1.6 and GLM 4.7 flash are the TOP of the intelligence line right now.
https://artificialanalysis.ai/models/open-source/small
The GLM 4.7 flash 30B requires double of the VRAM versus Apriel 1.6 (15B).
It would be great if we could have a magic quants version of both, but especially apriel 1.6, because that one is really slow in Q6 or Q8. While the Q4 is hallucinating to much in thinking mode. And Apriel 1.6 is the only one that can run on most consumer hardware without Super expensive GPU's. The only problem is speed vs quality. If we go Q4 it's unsuably bad for coding, while q6 is the best for that size, but extreemly slow at 4 tokens per second ...
Unsloth also published a new methodology that makes the process much faster and with less vram requirements:
https://unsloth.ai/docs/new/3x-faster-training-packing
Here is how you can start:
https://unsloth.ai/docs/basics/quantization-aware-training-qat
I would really appreciate a magic quantz ;-)
And you'll also probably will benefit of that one :-)
pls do this
seems like he's done
If it means anything I am still working on the project. Just busy with life and only able to put a few hours in here or there, but I got some good work on it over the weekend. I'm not using the old pipeline I built anymore. It takes a lot of time, there's flaws that're blatant to me, and I'm working very hard on version 2 which is built in a totally new framework and language.
But the original MagicQuant results truly were just my prototype. Plus, when the old pipeline runs, my entire PC is basically frozen until it's done and it can take days or weeks. So to be honest, I'm trying to make a proper code base that I can not only trust, have it be more perform ant, achieve better results, but also something I'd be comfortable releasing as an open source project. Because I don't really want to be the bottleneck of why people can't build MagicQuant models.
I am hoping to have a lot of the new Qwen3.5 and Gemma models made into the version 2 MagicQuant quantizations by the end of April or may. Then after hammering out the last of the details, I want to just release the code and let the community do with it as they want.
It makes me feel bad when people ask for models and I can't help. Because only my primary workstation can run the old pipeline and I can't have my PC hang for days or weeks when I have work to do. I work from home, so I need my main PC daily. All my previous MagicQuant models baked when I took leave.
take your time, it's just kinda been radio silent from huggingface but that's the only place i follow you
Thanks! On my GitHub here:
https://github.com/magiccodingman/MagicQuant-Wiki
I am trying to be a bit more active. Version 2 is taking a completely different direction. I learned a lot from version 1. And with KL Divergence now a benchmark, there's more nuance to how models are chosen. Previous models that were clear winners are not always clear winners anymore. Plus with a whole new philosophy of how to target and find the best quants has resulted in some really weird, but cool things. But I'm still sitting on it, digesting it, etc.
The hardest parts of my new framework are done. I'm now just tuning it and deciding on multiple factors now.
@sebastienbo @floory Please refer to the upcoming v2.0 launch. All v1.0 MagicQuant models will be depreciated. Original MagicQuant models act deceptively smart, but had major unmeasured flaws. New v2.0 not only resolves this, but introduces a new system that's fundamentally different. Keep an eye on the wiki and collection in the repo over the next week or 2. I may even silently launch a couple examples while testing. But v2.0 isn't just better, it's trustworthy which is way more important. I'm also blending in Unsloth Dynamic learned quants into tensor groups now too. So, the new release will be much more fun and production grade. Plus with a system that's significantly more trust worthy and not producing hidden damage either, I can just easily build additional model architectures as well.
I'm literally in the final stages right now of cleaning up v2.0 and documenting. It's already built on the back end, I'm merely cleaning up the code, output, and so on. After posting some small 4B test models, I'll be starting with the Qwen3.6 series, but will have the ability to work with way more now π
v2.0 isn't fully ready for showcasing, but if anyone is still interested:
https://huggingface.co/magiccodingman/Qwen3-4B-Instruct-2507-Unsloth-MagicQuant-GGUF
That's the first showcase of MagicQuant v2.0
The wiki has been updated:
https://github.com/magiccodingman/MagicQuant-Wiki