Instructions to use ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit") config = load_config("ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit
Run Hermes
hermes
gemma-4-19b-a4b-it-REAP-MLX-4bit
PLE-safe MLX 4-bit weights for 0xSero/gemma-4-19b-a4b-it-REAP on Apple Silicon.
REAP (Router-weighted Expert Activation Pruning) removes 30% of MoE experts while keeping the same active parameters per token (8 of 90 experts selected). Combined with PLE-safe 4-bit quantization, this model runs in 12.6 GB — fits on 24GB+ Macs.
| Original 26B | REAP (19B MoE (30% pruned)) | This model | |
|---|---|---|---|
| Experts/layer | 128 | 90 | 90 |
| Precision | BF16 | BF16 | 4-bit |
| Disk size | ~52 GB | ~36 GB | 12.6 GB |
Also available
- gemma-4-21b-a4b-it-REAP-MLX-4bit — 21B MoE (20% pruned), 13.9 GB
Accuracy Benchmarks
0-shot generative, thinking enabled, 50 samples per task, identical eval harness. Apple M4 Max 36GB.
| Task | 26B-A4B 4-bit (16.4 GB) | This model (12.6 GB) |
|---|---|---|
| Elementary Mathematics | 84% | 44% |
| Philosophy | 66% | 54% |
| World Religions | 66% | 34% |
| College Computer Science | 58% | 34% |
| High School Mathematics | 26% | 22% |
| Abstract Algebra | 44% | 36% |
| College Mathematics | 36% | 16% |
| Gsm8K | 64% | 62% |
The 30% expert pruning compounds with 4-bit quantization. Note: high extraction failure rates (up to 60%) on some tasks — the model generates verbose explanations instead of single-letter answers, so true accuracy may be higher than reported. Consider the 21B variant for better accuracy.
Extraction failures (unparseable responses) are counted as incorrect. REAP-19B: 113/400 (28%). True accuracy may be higher. Full methodology: GitHub.
Quantization Details
- Bits: 4
- Group size: 64
- Strategy: PLE-safe — only large
nn.LinearandSwitchLinear(MoE expert) layers are quantized. All PLE/ScaledLinear/vision layers stay in bf16.
| Quantized (4-bit) | Kept in bf16 |
|---|---|
| Attention projections (q/k/v/o_proj) | ScaledEmbedding (embed_tokens) |
| MLP layers (gate/up/down_proj) | ScaledLinear (PLE pathway) |
| MoE expert layers (SwitchLinear) | Per-layer embeddings (per_layer_*) |
| Vision encoder | |
| All norms and scalars |
Usage
pip install -U mlx-vlm
Vision
from mlx_vlm import load, generate
model, processor = load("ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit")
tokenizer = processor.tokenizer
messages = [{"role": "user", "content": [
{"type": "image", "url": "photo.jpg"},
{"type": "text", "text": "Describe this image in detail."},
]}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, ["photo.jpg"],
max_tokens=200, repetition_penalty=1.2, temperature=0.7)
print(out.text)
Text
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, max_tokens=100, temperature=0.0)
print(out.text)
Validation
Trimodal validation: 10/10 vision, 3/3 chat (EN/ZH/JA). Full results: GitHub.
Bugs Fixed in mlx-vlm
| # | Bug | Fix |
|---|---|---|
| 1 | ScaledLinear inherits nn.Module not nn.Linear |
Change to ScaledLinear(nn.Linear) |
| 2 | Standard quantization quantizes PLE layers | PLE-safe class_predicate |
| 3 | processor.save_pretrained() strips audio config |
Copy processor_config.json from source |
| 4 | SwitchLinear (MoE) not quantized |
Check hasattr(module, 'to_quantized') |
| 5 | embed_scale double-scaling (mlx-vlm 0.4.4+) |
Set Gemma4TextModel.embed_scale = 1.0 |
License
Model weights: Google Gemma License. Quantization scripts: MIT.
- Downloads last month
- 199
4-bit
Model tree for ukint-vs/gemma-4-19b-a4b-it-REAP-MLX-4bit
Base model
0xSero/gemma-4-19b-a4b-it-REAP