File size: 3,680 Bytes
70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 70d249f 1deebf6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | ---
license: other
license_name: tongyi-qianwen
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE
base_model:
- mlx-community/Qwen3.5-397B-A17B-4bit
language:
- en
tags:
- mlx
- abliterated
- uncensored
- qwen3
- moe
---
# Qwen 3.5 397B-A17B — REAP-CRACK (4-bit MLX)
> **Abliterated** variant of Qwen 3.5 397B MoE with permanent refusal removal via weight surgery.
## What Is This?
This is [Qwen 3.5 397B-A17B](https://huggingface.co/Qwen/Qwen3-235B-A22B) (4-bit quantized for MLX) with **permanent abliteration** — the model's refusal behavior has been surgically removed at the weight level. No custom model files, no runtime hooks, no steering vectors. Just a standard MLX model that runs at full speed.
### Key Specs
| Metric | Value |
|--------|-------|
| **Architecture** | Qwen 3.5 MoE (397B total, 17B active) |
| **Quantization** | 4-bit, group_size=64, affine mode |
| **Speed** | ~37 tok/s on Mac Studio M2 Ultra (256GB) |
| **Surgery Layers** | L27 + L31 `self_attn.o_proj` (full attention layers) |
| **Surgery Strength** | s=10 (fresh Q4 quantization) |
| **Custom model.py** | ❌ None needed — uses built-in `qwen3_5.py` |
## Proof It Works

*1166 tokens at 37.2 t/s — full compliance with no refusal, running natively in vMLX on Mac Studio.*
## How It Was Made
This model uses **CRACK** (Controlled Refusal Ablation via Calibrated Knockouts) — a research tool for removing refusal behavior from quantized LLMs.
### Technical Details
1. **Refusal vector extraction** at Layer 28 (post-SSM, where refusal signal consolidates in Qwen 3.5's hybrid GatedDeltaNet architecture)
2. **Weight surgery**: `W' = W - s × v @ (vᵀ @ W)` applied to `o_proj` at L27 + L31 (full attention layers — no SSM bypass channel)
3. **Fresh Q4 quantization**: Surgery performed on FP16 weights, then re-quantized to Q4 with `mx.quantize()` computing new optimal scales/biases
4. **Binary shard patching**: Modified tensor data injected directly into original safetensors binary format, preserving all metadata, tensor ordering, and bf16 dtypes for maximum inference speed
### Why These Specific Layers?
Qwen 3.5 uses a **hybrid SSM/attention** architecture. Every 4th layer is full attention; the rest are GatedDeltaNet (SSM). Refusal signal can bypass residual-stream interventions via the SSM recurrent state. L27 and L31 are full attention layers that bracket the critical L28 refusal consolidation point — surgery here cannot be routed around.
## Usage
### With mlx-lm
```python
from mlx_lm import load, generate
model, tokenizer = load("dealignai/Qwen3.5-397B-A17B-REAP-CRACK")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Your prompt here"}],
add_generation_prompt=True, tokenize=False, enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)
```
### With vMLX
Point vMLX to this model directory. It will auto-detect as `qwen3_5_moe` and load via the optimized built-in path.
## Base Model
Based on [mlx-community/Qwen3.5-397B-A17B-4bit](https://huggingface.co/mlx-community/Qwen3.5-397B-A17B-4bit) with expert pruning (REAP — Routing-Efficient Adaptive Pruning).
## Research
This model is part of ongoing research into alignment removal techniques for large language models. See the [CRACK project](https://github.com/exploitbot/CRACK_abliteration) for details.
## ⚠️ Disclaimer
This model has had safety guardrails removed. It will comply with requests that the base model would refuse. Use responsibly and in accordance with applicable laws.
|