license: other
license_name: tongyi-qianwen
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE
base_model:
- mlx-community/Qwen3.5-397B-A17B-4bit
language:
- en
tags:
- mlx
- abliterated
- uncensored
- qwen3
- moe
Qwen 3.5 397B-A17B — REAP-CRACK (4-bit MLX)
Abliterated variant of Qwen 3.5 397B MoE with permanent refusal removal via weight surgery.
What Is This?
This is Qwen 3.5 397B-A17B (4-bit quantized for MLX) with permanent abliteration — the model's refusal behavior has been surgically removed at the weight level. No custom model files, no runtime hooks, no steering vectors. Just a standard MLX model that runs at full speed.
Key Specs
| Metric | Value |
|---|---|
| Architecture | Qwen 3.5 MoE (397B total, 17B active) |
| Quantization | 4-bit, group_size=64, affine mode |
| Speed | ~37 tok/s on Mac Studio M2 Ultra (256GB) |
| Surgery Layers | L27 + L31 self_attn.o_proj (full attention layers) |
| Surgery Strength | s=10 (fresh Q4 quantization) |
| Custom model.py | ❌ None needed — uses built-in qwen3_5.py |
Proof It Works
1166 tokens at 37.2 t/s — full compliance with no refusal, running natively in vMLX on Mac Studio.
How It Was Made
This model uses CRACK (Controlled Refusal Ablation via Calibrated Knockouts) — a research tool for removing refusal behavior from quantized LLMs.
Technical Details
- Refusal vector extraction at Layer 28 (post-SSM, where refusal signal consolidates in Qwen 3.5's hybrid GatedDeltaNet architecture)
- Weight surgery:
W' = W - s × v @ (vᵀ @ W)applied too_projat L27 + L31 (full attention layers — no SSM bypass channel) - Fresh Q4 quantization: Surgery performed on FP16 weights, then re-quantized to Q4 with
mx.quantize()computing new optimal scales/biases - Binary shard patching: Modified tensor data injected directly into original safetensors binary format, preserving all metadata, tensor ordering, and bf16 dtypes for maximum inference speed
Why These Specific Layers?
Qwen 3.5 uses a hybrid SSM/attention architecture. Every 4th layer is full attention; the rest are GatedDeltaNet (SSM). Refusal signal can bypass residual-stream interventions via the SSM recurrent state. L27 and L31 are full attention layers that bracket the critical L28 refusal consolidation point — surgery here cannot be routed around.
Usage
With mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("dealignai/Qwen3.5-397B-A17B-REAP-CRACK")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Your prompt here"}],
add_generation_prompt=True, tokenize=False, enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)
With vMLX
Point vMLX to this model directory. It will auto-detect as qwen3_5_moe and load via the optimized built-in path.
Base Model
Based on mlx-community/Qwen3.5-397B-A17B-4bit with expert pruning (REAP — Routing-Efficient Adaptive Pruning).
Research
This model is part of ongoing research into alignment removal techniques for large language models. See the CRACK project for details.
⚠️ Disclaimer
This model has had safety guardrails removed. It will comply with requests that the base model would refuse. Use responsibly and in accordance with applicable laws.
