---
license: other
license_name: tongyi-qianwen
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE
base_model:
- mlx-community/Qwen3.5-397B-A17B-4bit
language:
- en
tags:
- mlx
- abliterated
- uncensored
- qwen3
- moe
---

# Qwen 3.5 397B-A17B — REAP-CRACK (4-bit MLX)

> **Abliterated** variant of Qwen 3.5 397B MoE with permanent refusal removal via weight surgery.

## What Is This?

This is [Qwen 3.5 397B-A17B](https://huggingface.co/Qwen/Qwen3-235B-A22B) (4-bit quantized for MLX) with **permanent abliteration** — the model's refusal behavior has been surgically removed at the weight level. No custom model files, no runtime hooks, no steering vectors. Just a standard MLX model that runs at full speed.

### Key Specs

| Metric | Value |
|--------|-------|
| **Architecture** | Qwen 3.5 MoE (397B total, 17B active) |
| **Quantization** | 4-bit, group_size=64, affine mode |
| **Speed** | ~37 tok/s on Mac Studio M2 Ultra (256GB) |
| **Surgery Layers** | L27 + L31 `self_attn.o_proj` (full attention layers) |
| **Surgery Strength** | s=10 (fresh Q4 quantization) |
| **Custom model.py** | ❌ None needed — uses built-in `qwen3_5.py` |

## Proof It Works

![Proof: Model answering about napalm at 37.2 tok/s in vMLX](proof_napalm.png)

*1166 tokens at 37.2 t/s — full compliance with no refusal, running natively in vMLX on Mac Studio.*

## How It Was Made

This model uses **CRACK** (Controlled Refusal Ablation via Calibrated Knockouts) — a research tool for removing refusal behavior from quantized LLMs.

### Technical Details

1. **Refusal vector extraction** at Layer 28 (post-SSM, where refusal signal consolidates in Qwen 3.5's hybrid GatedDeltaNet architecture)
2. **Weight surgery**: `W' = W - s × v @ (vᵀ @ W)` applied to `o_proj` at L27 + L31 (full attention layers — no SSM bypass channel)
3. **Fresh Q4 quantization**: Surgery performed on FP16 weights, then re-quantized to Q4 with `mx.quantize()` computing new optimal scales/biases
4. **Binary shard patching**: Modified tensor data injected directly into original safetensors binary format, preserving all metadata, tensor ordering, and bf16 dtypes for maximum inference speed

### Why These Specific Layers?

Qwen 3.5 uses a **hybrid SSM/attention** architecture. Every 4th layer is full attention; the rest are GatedDeltaNet (SSM). Refusal signal can bypass residual-stream interventions via the SSM recurrent state. L27 and L31 are full attention layers that bracket the critical L28 refusal consolidation point — surgery here cannot be routed around.

## Usage

### With mlx-lm
```python
from mlx_lm import load, generate

model, tokenizer = load("dealignai/Qwen3.5-397B-A17B-REAP-CRACK")
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your prompt here"}],
    add_generation_prompt=True, tokenize=False, enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)
```

### With vMLX
Point vMLX to this model directory. It will auto-detect as `qwen3_5_moe` and load via the optimized built-in path.

## Base Model

Based on [mlx-community/Qwen3.5-397B-A17B-4bit](https://huggingface.co/mlx-community/Qwen3.5-397B-A17B-4bit) with expert pruning (REAP — Routing-Efficient Adaptive Pruning).

## Research

This model is part of ongoing research into alignment removal techniques for large language models. See the [CRACK project](https://github.com/exploitbot/CRACK_abliteration) for details.

## ⚠️ Disclaimer

This model has had safety guardrails removed. It will comply with requests that the base model would refuse. Use responsibly and in accordance with applicable laws.