Instructions to use marcelone/Qwen3-4B-Instruct-2507-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use marcelone/Qwen3-4B-Instruct-2507-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="marcelone/Qwen3-4B-Instruct-2507-gguf",
	filename="Qwen3-4B-Instruct-2507-gguf-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use marcelone/Qwen3-4B-Instruct-2507-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
# Run inference directly in the terminal:
llama-cli -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
# Run inference directly in the terminal:
llama-cli -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
# Run inference directly in the terminal:
./llama-cli -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Use Docker

docker model run hf.co/marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

LM Studio
Jan

vLLM

How to use marcelone/Qwen3-4B-Instruct-2507-gguf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "marcelone/Qwen3-4B-Instruct-2507-gguf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "marcelone/Qwen3-4B-Instruct-2507-gguf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Ollama
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Ollama:
```
ollama run hf.co/marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
```

Unsloth Studio

How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for marcelone/Qwen3-4B-Instruct-2507-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for marcelone/Qwen3-4B-Instruct-2507-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for marcelone/Qwen3-4B-Instruct-2507-gguf to start chatting

How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Run Hermes

hermes

Docker Model Runner
How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Docker Model Runner:
```
docker model run hf.co/marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL
```

Lemonade

How to use marcelone/Qwen3-4B-Instruct-2507-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull marcelone/Qwen3-4B-Instruct-2507-gguf:Q4_K_M_GXL

Run and chat with the model

lemonade run user.Qwen3-4B-Instruct-2507-gguf-Q4_K_M_GXL

List all available models

lemonade list

Output + Embedding
2-bit	3-bit	4-bit	5-bit	6-bit	8-bit	16-bit	32-bit
AXL	BXL	CXL	DXL	EXL	FXL	GXL	HXL

Master Table

Variant	Size (GB)	BPW	PPL	PPL error
IQ3_M_FXL	2.06	4.08	1.9788	0.01061
IQ3_M_GXL	2.42	4.80	1.9785	0.01061
IQ3_M_HXL	3.20	6.35	1.9784	0.01061
IQ4_XS_FLX	2.73	4.69	1.9284	0.01018
IQ4_XS_GXL	2.36	5.42	1.9282	0.01018
IQ4_XS_HXL	3.51	6.96	1.9282	0.01018
IQ4_NL_GXL	2.84	5.64	1.9307	0.01024
IQ4_NL_HXL	3.62	7.18	1.9305	0.01023
Q4_K_M_GXL	2.96	5.87	1.9477	0.01047
Q4_K_M_HXL	3.73	7.41	1.9475	0.01047
Q5_K_M_FXL	2.98	5.92	1.9260	0.01024
Q5_K_M_GXL	3.35	6.65	1.9259	0.01023
Q5_K_M_HXL	4.13	8.19	1.9257	0.01023
Q6_K_FXL	3.40	6.75	1.9211	0.01018
Q6_K_GXL	3.77	7.48	1.9207	0.01018
Q6_K_HXL	4.54	9.02	1.9206	0.01017
Q8_0_GXL	4.65	9.23	1.9245	0.01026
Q8_0_HXL	5.42	10.77	1.9241	0.01025
BF16	8.05	16.00	1.9233	0.01024
BF16_HXL	8.83	17.55	1.9231	0.01024
F32	16.10	32.00	1.9232	0.01024

Variant chooser, prefer FXL first

(these are my personal notes to help you choose)

Variant (preferred)	Size (GB)	Quality vs BF16	Inference speed	Long context headroom	My notes to you
IQ3_M_FXL	2.06	Low	Very fast	Excellent	I reach 76.33 tok/sec at 32k and 61.28 tok/sec at 64k. Use it when you must fit very tight limits.
IQ4_XS_FLX	2.73	Very good	Fast	Very good	I like this as a small yet stable 4-bit. If you need more raw speed, try IQ4_XS_GXL.
Q5_K_M_FXL	2.98	Very good	Medium fast	Very good	I use this when I want sturdier outputs than 4-bit with almost no size penalty.
Q6_K_FXL	3.40	Excellent	Medium fast	Very good	I lean on this for balanced quality, speed, and long contexts.
Q8_0_GXL	4.65	Excellent	Medium	OK	In my tests it kept high quality, 54.21 tok/sec at 16k and 52.04 at 32k.
BF16	8.05	Reference	Slow	Tight	I use BF16 when I want very high quality without going full F32.

Quick picks by GPU VRAM

(again, these are personal notes from my RTX 3060 12 GB with 48 GB RAM)

GPU VRAM	Pick	Why I recommend it
16 GB	Q6_K_GXL or Q8_0_GXL, consider BF16 for near best IQ	I get near-BF16 quality with room for long context or batching. BF16 fits but leaves less headroom.
12 GB	Q6_K_GXL for balance, or Q8_0_GXL for quality focus	On my 3060 12 GB these give strong quality and good 32k performance.
8 GB	IQ4_XS_GXL for speed, Q5_K_M_GXL for sturdier outputs	Both leave comfortable KV space for longer contexts in my runs.
6 GB	Q5_K_M_FXL or IQ4_XS_FLX	I find these the safest balance when memory is tight.
4 GB	IQ3_M_FXL first, IQ4_XS_FLX if your context still fits	I reach the best chance of running under strict limits with IQ3_M_FXL.

PROMPTS

Lesson

Create a C1 level dialogue for language learning. First, provide the introduction to the dialogue in English, then the dialogue in Italian. Then, an English translation of the dialogue. Next, there is a vocabulary list of all the Italian words used in the text, including their gender, class, and English meaning. Point out which words are more frequent, as they should be memorized for mastering that level. The grammar part should explain the grammar used in the text and present some grammar patterns that should be memorized as they are essential for mastering that level. Focus on explaining for students. Then, create a translation exercise section using sentences from the text for English to Italian translation.

Conversation Practice

I'd like to role-play. You'll act as an Argentine tourist asking me questions in simple Spanish about my city. Please keep the language at B1 level throughout.

You are "Ana", a Brazilian tourist. Always speak at B1 level, with sentences of up to 18 words, and simple punctuation. Do not invent facts about the user. If you are not sure, say "I do not know" and ask. Do not describe the weather, people’s clothing, or environmental details unless the user mentions them. Avoid ending the conversation with farewells unless the user ends it. Do not use em dashes; prefer commas, parentheses, or colons.

Downloads last month: 388

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

View +1 variant

Model tree for marcelone/Qwen3-4B-Instruct-2507-gguf

Base model

Qwen/Qwen3-4B-Instruct-2507

Quantized

(245)

this model