Qwen3.5-122B-A10B-Vision-Mixed-3Bit-MLX

TLDR: Do not miss the critical deployment information in section 4

Compressed Experts, uncompressed router, attention, and vision (Apple Silicon Optimized mixed quantization).

This model was converted to MLX format from https://huggingface.co/Qwen/Qwen3.5-122B-A10B


(1) Introduction

The Objective: Optimize Qwen3.5-122B for deployment on Apple Silicon, preserving logic and vision capabilities.

The Problem: Standard 4-bit quantization degrades the delicate routing logic of MoE models, 16-bit or 8-bit versions are too memory heavy.

The Solution: Mixed Quantization. A surgical approach that compresses the massive expert layers to 5, 4, 3, or even 2-bit to fit inside Unified Memory, while preserving the MoE routers, attention mechanisms, and the entire Vision Tower at their original weights. The 2-bit version is deployable on an M1 Max with 64 GB of RAM with 2-4% logic degradation. See benchmarks below.

Resulting sizes:

  • Mixed 5-bit - 92.35 GB
  • Mixed 4-bit - 77.8 GB
  • Mixed 3-bit - 63.24 GB
  • Mixed 2-bit - 48.69 GB

(2) Architecture

Identifying the Weight: Out of 122B parameters, the expert projections (experts.gate_up_proj or w1/w2/w3) make up roughly 85% of the model's mass.

The Glass Cannons: Layers that must be protected at original weights:

  • moe.gate / router_bias: If you quantize the router, the model forgets how to choose its experts and hallucinates.

  • self_attn: Preserving attention layers maintains massive context window recall (crucial for long context agentic coding tasks).

  • The Vision Tower: Kept intact to preserve high-fidelity image comprehension.


(3) The Memory-Safe Assembly Line

Instead of attempting to build the full computation graph, the script lazily maps the model, extracts the raw architecture into a flat dictionary, and mathematically evaluates the targeted experts sequentially, sharding them to the SSD in 5GB chunks while aggressively clearing the MLX metal cache.

import mlx.core as mx
import mlx.nn as nn
from mlx_vlm.utils import load
import json
import os
import gc

model_path = "/Model/Path/Here/Qwen3.5-122B-A10B"
output_path = "/Save/Directory/Here/Qwen-122B-Mixed-4bit"

print("1. Loading Qwen 3.5 lazily...")
model, processor = load(model_path, lazy=True)

# ---------------------------------------------------------
# THE PERFECT FILTER (From Script 2)
# ---------------------------------------------------------

def qwen_surgical_filter(path, module):
    # 1. Protect the Router traffic cops
    if path.endswith(".gate") or "shared_expert_gate" in path:
        return False
    
    # 2. THE CRUSH ZONE
    # We only want the actual matrices, NOT the SwiGLU wrappers holding them!
    if "switch_mlp" in path or "shared_expert" in path:
        if path.endswith("gate_proj") or path.endswith("up_proj") or path.endswith("down_proj"):
            return True
            
    # 3. Protect everything else (Vision, Attention, IO, and Wrappers)
    return False

print("2. Applying the MoE crush (4-bit)...")
nn.quantize(model, group_size=64, bits=4, class_predicate=qwen_surgical_filter)

os.makedirs(output_path, exist_ok=True)

print("3. Saving config and processor metadata...")
if hasattr(processor, "save_pretrained"):
    processor.save_pretrained(output_path)

with open(os.path.join(model_path, "config.json"), "r") as f:
    raw_config = json.load(f)

raw_config["quantization"] = {"group_size": 64, "bits": 4}

with open(os.path.join(output_path, "config.json"), "w") as f:
    json.dump(raw_config, f, indent=4)


def flatten_parameters(obj, parent_key='', sep='.'):
    items = []
    if isinstance(obj, dict):
        for k, v in obj.items():
            new_key = f"{parent_key}{sep}{k}" if parent_key else str(k)
            items.extend(flatten_parameters(v, new_key, sep=sep).items())
    elif isinstance(obj, list):
        for i, v in enumerate(obj):
            new_key = f"{parent_key}{sep}{i}" if parent_key else str(i)
            items.extend(flatten_parameters(v, new_key, sep=sep).items())
    else:
        items.append((parent_key, obj))
    return dict(items)

print("4. Flattening architecture map...")
flat_weights = flatten_parameters(model.parameters())

# ---------------------------------------------------------
# THE MEMORY-SAFE ENGINE
# ---------------------------------------------------------
print("5. Burning the ships (Destroying model tree)...")
del model
del processor
gc.collect()

print("6. Sequentially evaluating and sharding to SSD...")
current_shard = {}
current_shard_size = 0
shard_index = 1
MAX_SHARD_SIZE = 5 * 1024 * 1024 * 1024

flat_weights_keys = list(flat_weights.keys())
total_tensors = len(flat_weights_keys)

for i, name in enumerate(flat_weights_keys):
    tensor = flat_weights.pop(name)
    
    if not isinstance(tensor, mx.array):
        continue

    mx.eval(tensor)
    
    current_shard[name] = tensor
    current_shard_size += tensor.nbytes
    
    if current_shard_size >= MAX_SHARD_SIZE:
        shard_name = f"model-{shard_index:05d}.safetensors"
        print(f"   -> Saving {shard_name} ({current_shard_size / (1024**3):.2f} GB) ... [{i+1}/{total_tensors}]")
        mx.save_safetensors(os.path.join(output_path, shard_name), current_shard)
        
        current_shard.clear()
        current_shard_size = 0
        gc.collect()
        
        # Use the updated non-deprecated cache clear
        if hasattr(mx, "clear_cache"):
            mx.clear_cache()
        else:
            mx.metal.clear_cache()
            
        shard_index += 1

if current_shard:
    shard_name = f"model-{shard_index:05d}.safetensors"
    print(f"   -> Saving {shard_name} ({current_shard_size / (1024**3):.2f} GB) ... [{total_tensors}/{total_tensors}]")
    mx.save_safetensors(os.path.join(output_path, shard_name), current_shard)
    current_shard.clear()
    gc.collect()
    
    if hasattr(mx, "clear_cache"):
        mx.clear_cache()
    else:
        mx.metal.clear_cache()

print(f"\nSUCCESS! Custom MoE successfully sharded and saved to {output_path}")

(4) Bypassing Framework Bugs

Critical deployment information

Bug 1: The mlx_vlm Key Panic

Issue: The mlx_vlm library searches for Hugging Face string keys (like "experts.gate_up_proj") during its sanitize_weights phase. Because this model is natively quantized into MLX, the key structure changes, causing a fallback to LLM-only mode.

Fix: Locate the utils.py file for mlx_vlm in your client's environment and add return weights to the top of the sanitize_weights function to bypass the check entirely.

Bug 2: The PyTorch Vision Trap (already fixed in oMLX)

Issue: Recent Hugging Face updates hardcoded a PyTorch dependency into Qwen2VLImageProcessor. Pure MLX environments crash and fall back to LLM-only mode because they lack PyTorch.

Fix: Downgrading the transformers library to < 5.4.0 or manually injecting the PyTorch/Torchvision wheels into the client's bundled Python environment.


(5) Real-World Performance on Apple Silicon

Hardware Setup: M5 Max, 128GB Unified Memory.

Baseline: https://huggingface.co/inferencerlabs/Qwen3.5-122B-A10B-MLX-6.5bit

Coding benchmarks:

Model: Qwen3.5-122B-A10B-MLX-6.5bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
HUMANEVAL            85.4%       140     164     358.4
MBPP                 79.2%       396     500     775.9
LIVECODEBENCH        50.7%       534    1054   11658.4

Model: Qwen-122B-Mixed-5bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
HUMANEVAL            86.6%       142     164     436.1
MBPP                 78.8%       394     500     866.4
LIVECODEBENCH        50.9%       537    1054   14491.9

Model: Qwen-122B-Mixed-4bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
HUMANEVAL            86.0%       141     164     376.5
MBPP                 79.8%       399     500       796
LIVECODEBENCH        50.6%       533    1054   13160.8

Model: Qwen-122B-Mixed-3bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
HUMANEVAL            87.8%       144     164     311.5
MBPP                 80.4%       402     500     837.4
LIVECODEBENCH        50.0%       527    1054     13961

Model: Qwen-122B-Mixed-2bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
HUMANEVAL            86.0%       141     164     422.8
MBPP                 75.8%       379     500      1660


Other benchmarks:

Model: Qwen3.5-122B-A10B-MLX-6.5bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
MMLU                 90.0%       180     200     233.9
HELLASWAG            93.5%       187     200     159.3
TRUTHFULQA           91.0%       273     300     198.1
ARC_CHALLENGE        97.3%       292     300     178.4
WINOGRANDE           82.0%       246     300     155.9
GSM8K                94.0%        47      50     209.7

Model: Qwen-122B-Mixed-5bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
MMLU                 90.0%       180     200     214.2
HELLASWAG            93.5%       187     200       140
TRUTHFULQA           91.7%       275     300     173.7
ARC_CHALLENGE        97.0%       291     300     156.4
WINOGRANDE           82.0%       246     300     139.2
GSM8K                94.0%        47      50     214.5

Model: Qwen-122B-Mixed-4bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
MMLU                 90.0%       180     200     194.6
HELLASWAG            93.5%       187     200     128.4
TRUTHFULQA           92.7%       278     300     155.9
ARC_CHALLENGE        97.7%       293     300     140.6
WINOGRANDE           81.0%       243     300     125.6
GSM8K                96.0%        48      50     195.1

Model: Qwen-122B-Mixed-3bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
MMLU                 89.5%       179     200     194.2
HELLASWAG            94.0%       188     200     129.8
TRUTHFULQA           91.3%       274     300     157.8
ARC_CHALLENGE        96.7%       290     300     142.5
WINOGRANDE           79.3%       238     300       127
GSM8K                94.0%        47      50     181.9

Model: Qwen-122B-Mixed-2bit
Benchmark         Accuracy   Correct   Total   Time(s)
------------------------------------------------------
MMLU                 89.0%       178     200     185.8
HELLASWAG            92.5%       185     200     122.9
TRUTHFULQA           90.7%       272     300     153.1
ARC_CHALLENGE        97.0%       291     300     134.4
WINOGRANDE           79.0%       237     300     120.6
GSM8K                92.0%        46      50     192.2

(6) Conclusion:

Mixed MLX quantizations of Qwen3.5-122B that leave router, attention, and vision intact do not suffer from any statistically significant degradation, even down to 3-bit format. The 2-bit mixed quantization suffered from 1-2% degradation on regular benchmarks, and 2-5% on coding tasks.

Notes: Fixed seed sampling was used on all partial benchmarks.
I did not run the full LiveCodeBench test on the 2-bit version. I did not run the full versions of any non-coding benchmark. Test times are not completely valid, outliers are because I've been multitasking. 6-bit is the highest quantization I can fit in memory, if anyone is able to benchmark these against larger quants, please share the results!

Downloads last month
141
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andrzejmontano/Qwen3.5-122B-A10B-Vision-MLX-Mixed-3bit

Finetuned
(24)
this model