Instructions to use marcelone/Qwen3-4B-Instruct-2507-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="marcelone/Qwen3-4B-Instruct-2507-gguf", filename="Qwen3-4B-Instruct-2507-gguf-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL # Run inference directly in the terminal: llama-cli -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL # Run inference directly in the terminal: llama-cli -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL # Run inference directly in the terminal: ./llama-cli -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL # Run inference directly in the terminal: ./build/bin/llama-cli -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
Use Docker
docker model run hf.co/marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
- LM Studio
- Jan
- vLLM
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "marcelone/Qwen3-4B-Instruct-2507-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "marcelone/Qwen3-4B-Instruct-2507-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
- Ollama
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Ollama:
ollama run hf.co/marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
- Unsloth Studio
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for marcelone/Qwen3-4B-Instruct-2507-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for marcelone/Qwen3-4B-Instruct-2507-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for marcelone/Qwen3-4B-Instruct-2507-gguf to start chatting
- Pi
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
Run Hermes
hermes
- Docker Model Runner
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Docker Model Runner:
docker model run hf.co/marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
- Lemonade
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
Run and chat with the model
lemonade run user.Qwen3-4B-Instruct-2507-gguf-Q4_K_M_GXL
List all available models
lemonade list
| Output + Embedding | ||||||||
|---|---|---|---|---|---|---|---|---|
| 2-bit | 3-bit | 4-bit | 5-bit | 6-bit | 8-bit | 16-bit | 32-bit | |
| AXL | BXL | CXL | DXL | EXL | FXL | GXL | HXL | |
Master Table
| Variant | Size (GB) | BPW | PPL | PPL error |
|---|---|---|---|---|
| IQ3_M_FXL | 2.06 | 4.08 | 1.9788 | 0.01061 |
| IQ3_M_GXL | 2.42 | 4.80 | 1.9785 | 0.01061 |
| IQ3_M_HXL | 3.20 | 6.35 | 1.9784 | 0.01061 |
| IQ4_XS_FLX | 2.73 | 4.69 | 1.9284 | 0.01018 |
| IQ4_XS_GXL | 2.36 | 5.42 | 1.9282 | 0.01018 |
| IQ4_XS_HXL | 3.51 | 6.96 | 1.9282 | 0.01018 |
| IQ4_NL_GXL | 2.84 | 5.64 | 1.9307 | 0.01024 |
| IQ4_NL_HXL | 3.62 | 7.18 | 1.9305 | 0.01023 |
| Q4_K_M_GXL | 2.96 | 5.87 | 1.9477 | 0.01047 |
| Q4_K_M_HXL | 3.73 | 7.41 | 1.9475 | 0.01047 |
| Q5_K_M_FXL | 2.98 | 5.92 | 1.9260 | 0.01024 |
| Q5_K_M_GXL | 3.35 | 6.65 | 1.9259 | 0.01023 |
| Q5_K_M_HXL | 4.13 | 8.19 | 1.9257 | 0.01023 |
| Q6_K_FXL | 3.40 | 6.75 | 1.9211 | 0.01018 |
| Q6_K_GXL | 3.77 | 7.48 | 1.9207 | 0.01018 |
| Q6_K_HXL | 4.54 | 9.02 | 1.9206 | 0.01017 |
| Q8_0_GXL | 4.65 | 9.23 | 1.9245 | 0.01026 |
| Q8_0_HXL | 5.42 | 10.77 | 1.9241 | 0.01025 |
| BF16 | 8.05 | 16.00 | 1.9233 | 0.01024 |
| BF16_HXL | 8.83 | 17.55 | 1.9231 | 0.01024 |
| F32 | 16.10 | 32.00 | 1.9232 | 0.01024 |
Variant chooser, prefer FXL first
(these are my personal notes to help you choose)
| Variant (preferred) | Size (GB) | Quality vs BF16 | Inference speed | Long context headroom | My notes to you |
|---|---|---|---|---|---|
| IQ3_M_FXL | 2.06 | Low | Very fast | Excellent | I reach 76.33 tok/sec at 32k and 61.28 tok/sec at 64k. Use it when you must fit very tight limits. |
| IQ4_XS_FLX | 2.73 | Very good | Fast | Very good | I like this as a small yet stable 4-bit. If you need more raw speed, try IQ4_XS_GXL. |
| Q5_K_M_FXL | 2.98 | Very good | Medium fast | Very good | I use this when I want sturdier outputs than 4-bit with almost no size penalty. |
| Q6_K_FXL | 3.40 | Excellent | Medium fast | Very good | I lean on this for balanced quality, speed, and long contexts. |
| Q8_0_GXL | 4.65 | Excellent | Medium | OK | In my tests it kept high quality, 54.21 tok/sec at 16k and 52.04 at 32k. |
| BF16 | 8.05 | Reference | Slow | Tight | I use BF16 when I want very high quality without going full F32. |
Quick picks by GPU VRAM
(again, these are personal notes from my RTX 3060 12 GB with 48 GB RAM)
| GPU VRAM | Pick | Why I recommend it |
|---|---|---|
| 16 GB | Q6_K_GXL or Q8_0_GXL, consider BF16 for near best IQ | I get near-BF16 quality with room for long context or batching. BF16 fits but leaves less headroom. |
| 12 GB | Q6_K_GXL for balance, or Q8_0_GXL for quality focus | On my 3060 12 GB these give strong quality and good 32k performance. |
| 8 GB | IQ4_XS_GXL for speed, Q5_K_M_GXL for sturdier outputs | Both leave comfortable KV space for longer contexts in my runs. |
| 6 GB | Q5_K_M_FXL or IQ4_XS_FLX | I find these the safest balance when memory is tight. |
| 4 GB | IQ3_M_FXL first, IQ4_XS_FLX if your context still fits | I reach the best chance of running under strict limits with IQ3_M_FXL. |
PROMPTS
Lesson
Create a C1 level dialogue for language learning. First, provide the introduction to the dialogue in English, then the dialogue in Italian. Then, an English translation of the dialogue. Next, there is a vocabulary list of all the Italian words used in the text, including their gender, class, and English meaning. Point out which words are more frequent, as they should be memorized for mastering that level. The grammar part should explain the grammar used in the text and present some grammar patterns that should be memorized as they are essential for mastering that level. Focus on explaining for students. Then, create a translation exercise section using sentences from the text for English to Italian translation.
Conversation Practice
I'd like to role-play. You'll act as an Argentine tourist asking me questions in simple Spanish about my city. Please keep the language at B1 level throughout.
You are "Ana", a Brazilian tourist. Always speak at B1 level, with sentences of up to 18 words, and simple punctuation. Do not invent facts about the user. If you are not sure, say "I do not know" and ask. Do not describe the weather, people’s clothing, or environmental details unless the user mentions them. Avoid ending the conversation with farewells unless the user ends it. Do not use em dashes; prefer commas, parentheses, or colons.
- Downloads last month
- 388
Model tree for marcelone/Qwen3-4B-Instruct-2507-gguf
Base model
Qwen/Qwen3-4B-Instruct-2507