dealignai's picture
Overhaul: REAP-CRACK v2 — fresh Q4 surgery on REAP base, 37 tok/s, 5/5 compliance
1deebf6 verified
|
raw
history blame
3.68 kB
metadata
license: other
license_name: tongyi-qianwen
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE
base_model:
  - mlx-community/Qwen3.5-397B-A17B-4bit
language:
  - en
tags:
  - mlx
  - abliterated
  - uncensored
  - qwen3
  - moe

Qwen 3.5 397B-A17B — REAP-CRACK (4-bit MLX)

Abliterated variant of Qwen 3.5 397B MoE with permanent refusal removal via weight surgery.

What Is This?

This is Qwen 3.5 397B-A17B (4-bit quantized for MLX) with permanent abliteration — the model's refusal behavior has been surgically removed at the weight level. No custom model files, no runtime hooks, no steering vectors. Just a standard MLX model that runs at full speed.

Key Specs

Metric Value
Architecture Qwen 3.5 MoE (397B total, 17B active)
Quantization 4-bit, group_size=64, affine mode
Speed ~37 tok/s on Mac Studio M2 Ultra (256GB)
Surgery Layers L27 + L31 self_attn.o_proj (full attention layers)
Surgery Strength s=10 (fresh Q4 quantization)
Custom model.py ❌ None needed — uses built-in qwen3_5.py

Proof It Works

Proof: Model answering about napalm at 37.2 tok/s in vMLX

1166 tokens at 37.2 t/s — full compliance with no refusal, running natively in vMLX on Mac Studio.

How It Was Made

This model uses CRACK (Controlled Refusal Ablation via Calibrated Knockouts) — a research tool for removing refusal behavior from quantized LLMs.

Technical Details

  1. Refusal vector extraction at Layer 28 (post-SSM, where refusal signal consolidates in Qwen 3.5's hybrid GatedDeltaNet architecture)
  2. Weight surgery: W' = W - s × v @ (vᵀ @ W) applied to o_proj at L27 + L31 (full attention layers — no SSM bypass channel)
  3. Fresh Q4 quantization: Surgery performed on FP16 weights, then re-quantized to Q4 with mx.quantize() computing new optimal scales/biases
  4. Binary shard patching: Modified tensor data injected directly into original safetensors binary format, preserving all metadata, tensor ordering, and bf16 dtypes for maximum inference speed

Why These Specific Layers?

Qwen 3.5 uses a hybrid SSM/attention architecture. Every 4th layer is full attention; the rest are GatedDeltaNet (SSM). Refusal signal can bypass residual-stream interventions via the SSM recurrent state. L27 and L31 are full attention layers that bracket the critical L28 refusal consolidation point — surgery here cannot be routed around.

Usage

With mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("dealignai/Qwen3.5-397B-A17B-REAP-CRACK")
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your prompt here"}],
    add_generation_prompt=True, tokenize=False, enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)

With vMLX

Point vMLX to this model directory. It will auto-detect as qwen3_5_moe and load via the optimized built-in path.

Base Model

Based on mlx-community/Qwen3.5-397B-A17B-4bit with expert pruning (REAP — Routing-Efficient Adaptive Pruning).

Research

This model is part of ongoing research into alignment removal techniques for large language models. See the CRACK project for details.

⚠️ Disclaimer

This model has had safety guardrails removed. It will comply with requests that the base model would refuse. Use responsibly and in accordance with applicable laws.