Qwen3.5-122B-A10B-Vision-Mixed-3Bit-MLX
TLDR: Do not miss the critical deployment information in section 4
Compressed Experts, uncompressed router, attention, and vision (Apple Silicon Optimized mixed quantization).
This model was converted to MLX format from https://huggingface.co/Qwen/Qwen3.5-122B-A10B
(1) Introduction
The Objective: Optimize Qwen3.5-122B for deployment on Apple Silicon, preserving logic and vision capabilities.
The Problem: Standard 4-bit quantization degrades the delicate routing logic of MoE models, 16-bit or 8-bit versions are too memory heavy.
The Solution: Mixed Quantization. A surgical approach that compresses the massive expert layers to 5, 4, 3, or even 2-bit to fit inside Unified Memory, while preserving the MoE routers, attention mechanisms, and the entire Vision Tower at their original weights. The 2-bit version is deployable on an M1 Max with 64 GB of RAM with 2-4% logic degradation. See benchmarks below.
Resulting sizes:
- Mixed 5-bit - 92.35 GB
- Mixed 4-bit - 77.8 GB
- Mixed 3-bit - 63.24 GB
- Mixed 2-bit - 48.69 GB
(2) Architecture
Identifying the Weight: Out of 122B parameters, the expert projections (experts.gate_up_proj or w1/w2/w3) make up roughly 85% of the model's mass.
The Glass Cannons: Layers that must be protected at original weights:
moe.gate / router_bias: If you quantize the router, the model forgets how to choose its experts and hallucinates.
self_attn: Preserving attention layers maintains massive context window recall (crucial for long context agentic coding tasks).
The Vision Tower: Kept intact to preserve high-fidelity image comprehension.
(3) The Memory-Safe Assembly Line
Instead of attempting to build the full computation graph, the script lazily maps the model, extracts the raw architecture into a flat dictionary, and mathematically evaluates the targeted experts sequentially, sharding them to the SSD in 5GB chunks while aggressively clearing the MLX metal cache.
import mlx.core as mx
import mlx.nn as nn
from mlx_vlm.utils import load
import json
import os
import gc
model_path = "/Model/Path/Here/Qwen3.5-122B-A10B"
output_path = "/Save/Directory/Here/Qwen-122B-Mixed-4bit"
print("1. Loading Qwen 3.5 lazily...")
model, processor = load(model_path, lazy=True)
# ---------------------------------------------------------
# THE PERFECT FILTER (From Script 2)
# ---------------------------------------------------------
def qwen_surgical_filter(path, module):
# 1. Protect the Router traffic cops
if path.endswith(".gate") or "shared_expert_gate" in path:
return False
# 2. THE CRUSH ZONE
# We only want the actual matrices, NOT the SwiGLU wrappers holding them!
if "switch_mlp" in path or "shared_expert" in path:
if path.endswith("gate_proj") or path.endswith("up_proj") or path.endswith("down_proj"):
return True
# 3. Protect everything else (Vision, Attention, IO, and Wrappers)
return False
print("2. Applying the MoE crush (4-bit)...")
nn.quantize(model, group_size=64, bits=4, class_predicate=qwen_surgical_filter)
os.makedirs(output_path, exist_ok=True)
print("3. Saving config and processor metadata...")
if hasattr(processor, "save_pretrained"):
processor.save_pretrained(output_path)
with open(os.path.join(model_path, "config.json"), "r") as f:
raw_config = json.load(f)
raw_config["quantization"] = {"group_size": 64, "bits": 4}
with open(os.path.join(output_path, "config.json"), "w") as f:
json.dump(raw_config, f, indent=4)
def flatten_parameters(obj, parent_key='', sep='.'):
items = []
if isinstance(obj, dict):
for k, v in obj.items():
new_key = f"{parent_key}{sep}{k}" if parent_key else str(k)
items.extend(flatten_parameters(v, new_key, sep=sep).items())
elif isinstance(obj, list):
for i, v in enumerate(obj):
new_key = f"{parent_key}{sep}{i}" if parent_key else str(i)
items.extend(flatten_parameters(v, new_key, sep=sep).items())
else:
items.append((parent_key, obj))
return dict(items)
print("4. Flattening architecture map...")
flat_weights = flatten_parameters(model.parameters())
# ---------------------------------------------------------
# THE MEMORY-SAFE ENGINE
# ---------------------------------------------------------
print("5. Burning the ships (Destroying model tree)...")
del model
del processor
gc.collect()
print("6. Sequentially evaluating and sharding to SSD...")
current_shard = {}
current_shard_size = 0
shard_index = 1
MAX_SHARD_SIZE = 5 * 1024 * 1024 * 1024
flat_weights_keys = list(flat_weights.keys())
total_tensors = len(flat_weights_keys)
for i, name in enumerate(flat_weights_keys):
tensor = flat_weights.pop(name)
if not isinstance(tensor, mx.array):
continue
mx.eval(tensor)
current_shard[name] = tensor
current_shard_size += tensor.nbytes
if current_shard_size >= MAX_SHARD_SIZE:
shard_name = f"model-{shard_index:05d}.safetensors"
print(f" -> Saving {shard_name} ({current_shard_size / (1024**3):.2f} GB) ... [{i+1}/{total_tensors}]")
mx.save_safetensors(os.path.join(output_path, shard_name), current_shard)
current_shard.clear()
current_shard_size = 0
gc.collect()
# Use the updated non-deprecated cache clear
if hasattr(mx, "clear_cache"):
mx.clear_cache()
else:
mx.metal.clear_cache()
shard_index += 1
if current_shard:
shard_name = f"model-{shard_index:05d}.safetensors"
print(f" -> Saving {shard_name} ({current_shard_size / (1024**3):.2f} GB) ... [{total_tensors}/{total_tensors}]")
mx.save_safetensors(os.path.join(output_path, shard_name), current_shard)
current_shard.clear()
gc.collect()
if hasattr(mx, "clear_cache"):
mx.clear_cache()
else:
mx.metal.clear_cache()
print(f"\nSUCCESS! Custom MoE successfully sharded and saved to {output_path}")
(4) Bypassing Framework Bugs
Critical deployment information
Bug 1: The mlx_vlm Key Panic
Issue: The mlx_vlm library searches for Hugging Face string keys (like "experts.gate_up_proj") during its sanitize_weights phase. Because this model is natively quantized into MLX, the key structure changes, causing a fallback to LLM-only mode.
Fix: Locate the utils.py file for mlx_vlm in your client's environment and add return weights to the top of the sanitize_weights function to bypass the check entirely.
Bug 2: The PyTorch Vision Trap (already fixed in oMLX)
Issue: Recent Hugging Face updates hardcoded a PyTorch dependency into Qwen2VLImageProcessor. Pure MLX environments crash and fall back to LLM-only mode because they lack PyTorch.
Fix: Downgrading the transformers library to < 5.4.0 or manually injecting the PyTorch/Torchvision wheels into the client's bundled Python environment.
(5) Real-World Performance on Apple Silicon
Hardware Setup: M5 Max, 128GB Unified Memory.
Baseline: https://huggingface.co/inferencerlabs/Qwen3.5-122B-A10B-MLX-6.5bit
Coding benchmarks:
Model: Qwen3.5-122B-A10B-MLX-6.5bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
HUMANEVAL 85.4% 140 164 358.4
MBPP 79.2% 396 500 775.9
LIVECODEBENCH 50.7% 534 1054 11658.4
Model: Qwen-122B-Mixed-5bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
HUMANEVAL 86.6% 142 164 436.1
MBPP 78.8% 394 500 866.4
LIVECODEBENCH 50.9% 537 1054 14491.9
Model: Qwen-122B-Mixed-4bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
HUMANEVAL 86.0% 141 164 376.5
MBPP 79.8% 399 500 796
LIVECODEBENCH 50.6% 533 1054 13160.8
Model: Qwen-122B-Mixed-3bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
HUMANEVAL 87.8% 144 164 311.5
MBPP 80.4% 402 500 837.4
LIVECODEBENCH 50.0% 527 1054 13961
Model: Qwen-122B-Mixed-2bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
HUMANEVAL 86.0% 141 164 422.8
MBPP 75.8% 379 500 1660
Other benchmarks:
Model: Qwen3.5-122B-A10B-MLX-6.5bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
MMLU 90.0% 180 200 233.9
HELLASWAG 93.5% 187 200 159.3
TRUTHFULQA 91.0% 273 300 198.1
ARC_CHALLENGE 97.3% 292 300 178.4
WINOGRANDE 82.0% 246 300 155.9
GSM8K 94.0% 47 50 209.7
Model: Qwen-122B-Mixed-5bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
MMLU 90.0% 180 200 214.2
HELLASWAG 93.5% 187 200 140
TRUTHFULQA 91.7% 275 300 173.7
ARC_CHALLENGE 97.0% 291 300 156.4
WINOGRANDE 82.0% 246 300 139.2
GSM8K 94.0% 47 50 214.5
Model: Qwen-122B-Mixed-4bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
MMLU 90.0% 180 200 194.6
HELLASWAG 93.5% 187 200 128.4
TRUTHFULQA 92.7% 278 300 155.9
ARC_CHALLENGE 97.7% 293 300 140.6
WINOGRANDE 81.0% 243 300 125.6
GSM8K 96.0% 48 50 195.1
Model: Qwen-122B-Mixed-3bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
MMLU 89.5% 179 200 194.2
HELLASWAG 94.0% 188 200 129.8
TRUTHFULQA 91.3% 274 300 157.8
ARC_CHALLENGE 96.7% 290 300 142.5
WINOGRANDE 79.3% 238 300 127
GSM8K 94.0% 47 50 181.9
Model: Qwen-122B-Mixed-2bit
Benchmark Accuracy Correct Total Time(s)
------------------------------------------------------
MMLU 89.0% 178 200 185.8
HELLASWAG 92.5% 185 200 122.9
TRUTHFULQA 90.7% 272 300 153.1
ARC_CHALLENGE 97.0% 291 300 134.4
WINOGRANDE 79.0% 237 300 120.6
GSM8K 92.0% 46 50 192.2
(6) Conclusion:
Mixed MLX quantizations of Qwen3.5-122B that leave router, attention, and vision intact do not suffer from any statistically significant degradation, even down to 3-bit format. The 2-bit mixed quantization suffered from 1-2% degradation on regular benchmarks, and 2-5% on coding tasks.
Notes:
Fixed seed sampling was used on all partial benchmarks.
I did not run the full LiveCodeBench test on the 2-bit version.
I did not run the full versions of any non-coding benchmark.
Test times are not completely valid, outliers are because I've been multitasking.
6-bit is the highest quantization I can fit in memory, if anyone is able to benchmark these against larger quants, please share the results!
- Downloads last month
- 141
Quantized
Model tree for andrzejmontano/Qwen3.5-122B-A10B-Vision-MLX-Mixed-3bit
Base model
Qwen/Qwen3.5-122B-A10B