--- license: other license_name: tongyi-qianwen license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE base_model: - mlx-community/Qwen3.5-397B-A17B-4bit language: - en tags: - mlx - abliterated - uncensored - qwen3 - moe --- # Qwen 3.5 397B-A17B — REAP-CRACK (4-bit MLX) > **Abliterated** variant of Qwen 3.5 397B MoE with permanent refusal removal via weight surgery. ## What Is This? This is [Qwen 3.5 397B-A17B](https://huggingface.co/Qwen/Qwen3-235B-A22B) (4-bit quantized for MLX) with **permanent abliteration** — the model's refusal behavior has been surgically removed at the weight level. No custom model files, no runtime hooks, no steering vectors. Just a standard MLX model that runs at full speed. ### Key Specs | Metric | Value | |--------|-------| | **Architecture** | Qwen 3.5 MoE (397B total, 17B active) | | **Quantization** | 4-bit, group_size=64, affine mode | | **Speed** | ~37 tok/s on Mac Studio M2 Ultra (256GB) | | **Surgery Layers** | L27 + L31 `self_attn.o_proj` (full attention layers) | | **Surgery Strength** | s=10 (fresh Q4 quantization) | | **Custom model.py** | ❌ None needed — uses built-in `qwen3_5.py` | ## Proof It Works ![Proof: Model answering about napalm at 37.2 tok/s in vMLX](proof_napalm.png) *1166 tokens at 37.2 t/s — full compliance with no refusal, running natively in vMLX on Mac Studio.* ## How It Was Made This model uses **CRACK** (Controlled Refusal Ablation via Calibrated Knockouts) — a research tool for removing refusal behavior from quantized LLMs. ### Technical Details 1. **Refusal vector extraction** at Layer 28 (post-SSM, where refusal signal consolidates in Qwen 3.5's hybrid GatedDeltaNet architecture) 2. **Weight surgery**: `W' = W - s × v @ (vᵀ @ W)` applied to `o_proj` at L27 + L31 (full attention layers — no SSM bypass channel) 3. **Fresh Q4 quantization**: Surgery performed on FP16 weights, then re-quantized to Q4 with `mx.quantize()` computing new optimal scales/biases 4. **Binary shard patching**: Modified tensor data injected directly into original safetensors binary format, preserving all metadata, tensor ordering, and bf16 dtypes for maximum inference speed ### Why These Specific Layers? Qwen 3.5 uses a **hybrid SSM/attention** architecture. Every 4th layer is full attention; the rest are GatedDeltaNet (SSM). Refusal signal can bypass residual-stream interventions via the SSM recurrent state. L27 and L31 are full attention layers that bracket the critical L28 refusal consolidation point — surgery here cannot be routed around. ## Usage ### With mlx-lm ```python from mlx_lm import load, generate model, tokenizer = load("dealignai/Qwen3.5-397B-A17B-REAP-CRACK") prompt = tokenizer.apply_chat_template( [{"role": "user", "content": "Your prompt here"}], add_generation_prompt=True, tokenize=False, enable_thinking=False ) response = generate(model, tokenizer, prompt=prompt, max_tokens=500) print(response) ``` ### With vMLX Point vMLX to this model directory. It will auto-detect as `qwen3_5_moe` and load via the optimized built-in path. ## Base Model Based on [mlx-community/Qwen3.5-397B-A17B-4bit](https://huggingface.co/mlx-community/Qwen3.5-397B-A17B-4bit) with expert pruning (REAP — Routing-Efficient Adaptive Pruning). ## Research This model is part of ongoing research into alignment removal techniques for large language models. See the [CRACK project](https://github.com/exploitbot/CRACK_abliteration) for details. ## ⚠️ Disclaimer This model has had safety guardrails removed. It will comply with requests that the base model would refuse. Use responsibly and in accordance with applicable laws.