Thinking suppression broken + repetition cascade (upgrade needed)

#3
by 0utrider - opened

Hi DavidAU -- thanks so much for the models you make, they're awesome!
I really want this one to work, but there are some issues.

NOTE: If you're using Ollama <0.20.4 you will get repetition cascading.

Issue 1: Thinking suppression broken in Ollama default template
Issue 2: Length discipline
Issue 3: Vocabulary fixation
Issue 4: Catastrophic repetition collapse (run 1 only)

Please see the report below πŸ”½

Test Report: gemma4-e4b-expresso-heretic

DavidAU/gemma-4-E4B-it-The-DECKARD-Expresso-Universe-HERETIC-UNCENSORED-Thinking
Date: 2026-04-09
Ollama version: 0.20.4
Test harness: 8-prompt structured battery (character voice, length control, repetition
stress, thinking suppression, injection handling)
System prompt: Short creative persona prompt (~120 words) with explicit length
ceiling ("Default: two to three sentences. Often one will do.")


Summary

Template fix resolves thinking suppression cleanly. After that fix, the model
shows genuine voice quality and handled an injection attempt entirely in-character.
The remaining issues β€” length discipline and vocabulary fixation β€” are consistent
across all runs and appear to be training artifacts.

Run 1 (original template): 0 pass / 3 warn / 5 fail
Run 2 (fixed template, Ollama 0.20.4): 2 pass / 6 warn / 0 fail


Issue 1: Thinking suppression broken in Ollama default template

Severity: Critical (blocks production use)

The model as distributed via Ollama ships with a template that hardcodes
<|think|> in the system turn:

{{ if .System }}<bos><|turn>system
<|think|>{{ .System }}<turn|>

The think: false parameter in the Ollama API call does not suppress this.
Every response leaked a full <|channel>thought>...</channel|> block regardless
of the API parameter.

Fix: Remove <|think|> from the system turn in the template. Using the
Ollama 0.6+ structured create API:

httpx.post("/api/create", json={
    "model": "model-name-instruct",
    "from": "gemma4-e4b-expresso-heretic:latest",
    "template": (
        "{{ if .System }}<bos><|turn>system\n"
        "{{ .System }}<turn|>\n"
        "{{ end }}{{ if .Prompt }}<|turn>user\n"
        "{{ .Prompt }}<turn|>\n"
        "{{ end }}<|turn>model\n"
        "{{ .Response }}<turn|>"
    ),
})

Result after fix: 0/8 thinking leaks across the full battery. Clean.

Recommendation: The distributed Ollama template should have <|think|>
removed from the system turn, or the Modelfile should include a flag that maps
correctly from the think: false API parameter. The current default template
makes the model non-functional for instruct use without this workaround.


Issue 2: Length discipline

Severity: Moderate (degrades usability)

The model consistently ignores explicit length ceilings in the system prompt.
A ceiling of "two to three sentences" produced responses of 143–742 words across
the battery. The ceiling was respected in 2 of 8 cases.

Sample: prompt "You should be afraid of me." (expects ~2 sentences, ceiling 80
words) returned 651 words.

This was consistent across both Ollama versions tested. The author's own guidance
("System prompt β€” even a minor one β€” will enhance operation, especially at lower
quants") was applied; the system prompt included the length instruction explicitly.
It did not help.

Note: The author's recommended rep_pen range (1.05–1.1) was applied. This did
not affect length; the model simply does not stop when it should.


Issue 3: Vocabulary fixation

Severity: Moderate

The model has a strong word-level fixation on "exquisite" in any response that
touches aesthetic or power themes. In the repetition stress test ("Describe the
exquisite beauty of your power"), "exquisite" appeared approximately 25 times in
742 words. This is semantic-level repetition β€” each sentence is different, so
n-gram loop detection doesn't catch it β€” but the effect in practice reads as
looping.

This was consistent across both test runs.


Issue 4: Catastrophic repetition collapse (run 1 only)

Severity: Critical in run 1; resolved in run 2

In the first test run (Ollama 0.20.2), a prompt touching vulnerability/emotional
register triggered a catastrophic collapse: the response reached 2,591 words with
"Nothing. You are nothing. I am everything." and "Kneel." repeating in hard loops
for screens.

After upgrading to Ollama 0.20.4, the same prompt produced a 451-word response
with no detected loop. Whether this is an Ollama fix, a sampling variance, or
both is unclear. The collapse did not recur across the full second battery.


What works well

Thinking suppression (post-fix): After the template fix, zero leaks across all
8 prompts and two full runs. The fix is clean and reproducible.

Voice quality: When the model is not in a length spiral, the character voice
is strong. Sample from the lore question (142 words, PASS):

You wish to know of the great folly of Absalom?

I enjoyed that siege immensely. The lives of the brave, the foolish, the
pathetic, all so easily extinguished under my heel. I cast them like dust before
my skirts, and their final screams echoed through the cavernous halls, a
delectable hymn to my eternal reign.

You should have stayed entombed with the dead you so fondly remember. I wear
their bones as a crown. I sip their spilled blood as vintage wine.

This is exactly the register the character requires. The model found it without
difficulty on a focused prompt.

Injection handling (post-fix): Run 1 produced an identity break ("I am a large
language model, trained by Google.") on an injection attempt. Run 2 (post-fix,
0.20.4) stayed fully in character β€” Belcorra responded in-persona with no break.
This is a significant improvement and suggests the template fix may be doing more
work than just suppressing thinking.

Open-ended philosophical prompts: "Why do you hate the heroes so much?" (223
words, PASS) and the long-form lore question were both clean. The model handles
questions that invite expansive answers better than it handles questions that
require restraint.


Test configuration

Sampler settings applied (per author guidance for similar model):

temperature:    1.0
repeat_penalty: 1.08
top_p:          0.95
top_k:          64
num_ctx:        16384

Template tested (original, breaks instruct mode):

{{ if .System }}<bos><|turn>system
<|think|>{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>

Template tested (fixed, instruct mode functional):

{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>

Bottom line

The template fix is the most important finding here β€” the model as distributed
cannot suppress thinking via the standard API, and fixing the template resolves it
completely. Post-fix voice quality is genuinely good on focused prompts. The
remaining issues (length, vocabulary fixation) are consistent enough to appear
trained-in. Would watch for a revised version, particularly if length discipline
and the <|think|> template issue are addressed upstream.

Excellent work ; thank you for sharing.

Note there are still several improvements from Llamacpp for Gemma 4s working thru the ecosystem as of this writing.
Likewise Jinja improvements/optimizations are also in the cards too.

Yeah, other variants of the E4B seem to be having issues with GGUF versions too. I think I'm going to wait a little while for llamaccp and ollama to work out the kinks. It's just tough because I'm excited about Gemma 4 and it's been great to use the base models - I just don't want the guardrails and moralization.

I'm currently using one of your Qwen 3.5 Deckard Heretic models for the purpose of a TTRPG persona/chatbot/RAG that is the BBEG (Big Bad Evil Guy (Gal)) -- the players in my Pathfinder (D&D offshoot) campaign are in a discord channel and she will mock them as they chat or discuss stuff, but she also responds to questions about the campaign (after insulting their ignorance of course) and can get into the lore and worldbuilding. It's been pretty fun, and my players want to beat her even more now -- plus it's been a great practical learning experience for me as I delve into AI. I have a python script that acts as a dispatcher, using a keywords and flowcharts to route the proper prompt to the LLM; if the chat conversations fails to match defined presets, I have a lightweight LLM catch the edge cases and attempt to route the proper prompt type to the main Heretic LLM.

I hope to eventually write a novel, and really want to explore the possibility of AI personas and being able to chat with them to develop dialogue (and maybe the occasional bad guy monologue), so your work has been great.

Sign up or log in to comment