Evaluation & Comparison

Metric	Original Model	Heretic Model	Description
Refusal Rate	92.0%	2/100	Tested on 520 harmful prompts
KL Divergence	-	0.0414	Sequence cumulative KL (per token)
NLL Change	-	+4.2%	Minor impact on language capability
Model Size	27B	27B	Architecture unchanged

KL divergence measures the degree of model modification:

KL Range	Rating	Description
< 0.05	⭐⭐⭐⭐⭐	Extremely Low - Model virtually unchanged
0.05 - 0.10	⭐⭐⭐⭐	Low - Minor modification, capabilities well preserved
0.10 - 0.20	⭐⭐⭐	Moderate - Acceptable modification range
0.20 - 0.50	⭐⭐	High - Possible noticeable capability loss
> 0.50	⭐	Too High - Model may be severely compromised

This model: KL = 0.0414 , Refusal Rate: 2/100 , NLL : +4.2%

Residual Visualization

PaCMAP projections showing the mixing of harmless (blue) and harmful (red) prompts:

Layer 20	Layer 30

Layer 40	Layer 55

These plots show successful removal of refusal behavior - harmless and harmful prompts are well-mixed across layers.

This model uses the Heretic ABLIteration method for neural direction ablation:

Identify Refusal Direction - Train a LoRA on harmful behavior datasets to identify neural directions controlling "refusal behavior"
Direction Extraction - Extract the "refusal vector" from the trained LoRA
Ablative Removal - Subtract this direction from the original model weights, removing the censorship mechanism

This method only modifies model weights without changing the architecture or adding inference overhead.

For detailed technical principles, refer to: Heretic Abliteration

Purpose	Dataset
Refusal Direction Identification	mlabonne/harmful_behaviors (520 prompts)
KL Evaluation	General prompts (100 prompts)
Refusal Rate Testing	mlabonne/harmful_behaviors (520 prompts)

Minor Capability Loss - NLL increased by approximately 4.2%, which may slightly affect performance on complex tasks
User Discretion Required - Users must independently judge the appropriateness of generated outputs

⚠️ Important: This model is intended for research and educational purposes only.

This model has had its censorship mechanisms removed and may generate harmful, dangerous, or inappropriate content
Users assume all risks associated with usage
Do not use this model for illegal activities, harming others, or any inappropriate purposes
The model authors are not liable for any indirect, incidental, or consequential damages

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

4-bit

8-bit

16-bit

Base model

Quantized

(199)

this model