Instructions to use KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Unsloth Studio

How to use KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit",
    max_seq_length=2048,
)

How to use KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit

Run Hermes

hermes

MLX LM

How to use KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwopus-GLM-18B-Healed — MLX 4-bit

Apple Silicon / MLX 4-bit quantization of the healed Qwopus-GLM-18B frankenmerge. Ready to run on Macs with the MLX framework via mlx-lm.

Source (BF16): KyleHessling1/Qwopus-GLM-18B-Healed
Q4_K_M GGUF (llama.cpp): KyleHessling1/Qwopus-GLM-18B-Merged-GGUF

Quickstart

pip install -U "mlx-lm>=0.31.2"

from mlx_lm import load, generate

model, tokenizer = load("KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit")
print(generate(model, tokenizer, prompt="The capital of France is", max_tokens=64))

Or from the CLI:

python3 -m mlx_lm generate \
  --model KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit \
  --prompt "Write a haiku about Apple Silicon." \
  --max-tokens 128

Runs comfortably on a 16–24 GB unified-memory Mac (M-series).

Quantization

Property	Value
Method	MLX affine quantization (`mlx_lm.convert -q`)
Bits / weight	4 (effective 4.502 after non-quantized layers)
Group size	64
Non-quant dtype	bfloat16
Output size	~8.4 GB (2 safetensor shards)
Quantizer version	`mlx-lm` 0.31.2 / `mlx` 0.31.1

Reproducible from the BF16 source with:

python3 -m mlx_lm convert \
  --hf-path KyleHessling1/Qwopus-GLM-18B-Healed \
  --mlx-path ./Qwopus-GLM-18B-Healed-MLX-4bit \
  -q --q-bits 4 --q-group-size 64

Base Model

A 64-layer frankenmerge of two of Jackrong's Qwen3.5-9B finetunes, healed with a 1000-step QLoRA fine-tune:

Layers 0–31: Jackrong/Qwopus3.5-9B-v3.5 (Opus reasoning distill)
Layers 32–63: Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1 (GLM-5.1 reasoning distill)
Heal training: 1000 steps QLoRA (rank 64) on Jackrong's training data to smooth the layer boundary

Architecture

Property	Value
Parameters	~18B
Layers	64 (32 + 32)
Hidden Size	4096
Attention Heads	16 (4 KV heads, GQA)
Attention Type	Hybrid (linear + full, every 4th layer)
Context Length	262,144 tokens
Source Precision	BF16

Capability Suite (from base model)

Beats Qwen 3.6-35B-A3B MoE on a 44-test capability suite at less than half the VRAM:

	Qwopus-GLM-18B (healed)	Qwen 3.6-35B MoE
Score	40/44 (90.9%)	38/44 (86.4%)
Tool Calling	6/6	6/6
Agentic	4/4	4/4
Programming	12/15	12/15

Frontend stress tests: 62/63 checks passed across 6 complex HTML/CSS/JS generation tasks with perfectly balanced braces/parens and zero garbled output.

Note: benchmarks were measured on the BF16 base / Q4_K_M GGUF. The MLX 4-bit weights are a separate quantization and have not been independently re-benchmarked — expect quality within normal 4-bit quantization variance.

Known Issues

The tokenizer emits a Mistral-regex warning on load (inherited from the source repo). Benign for Qwen tokenization in practice.

Credits

All credit for the source models goes to Jackrong. The heal training used his published datasets. See the full merge documentation for the complete technical workflow.

MLX quantization by @KyleHessling1 using mlx-lm.

License

Apache 2.0 (inherited from source models)

Contact

Questions, issues, or cool projects? Reach out on X: @KyleHessling1

Downloads last month: 738

Safetensors

Model size

16B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for KyleHessling1/Qwopus-GLM-18B-Healed-MLX-4bit

Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1

Jackrong/Qwopus3.5-9B-v3.5

KyleHessling1/Qwopus-GLM-18B-Healed

Merge model

this model