llama.cpp

Timothy Gregg 29, Apr, 2026

Blog

Fixing Gemma 4 Thinking Prompts in llama.cpp, Locally First

On April 29, 2026, I finished a small local fix in my CompleteTech AI Research fork of llama.cpp: Gemma 4 thinking mode needed the generation prompt to open the thought channel, not close an empty one.

I am making the work public because the failure mode is useful for other people running local models to understand. This is not a claim that the change has landed in upstream llama.cpp. It is a completed personal/fork fix, published openly so the behavior, tests, and validation trail are visible.

What I saw

The bug was in the generation prompt path for the shipped Gemma 4 templates. When llama.cpp applies a chat template with add_generation_prompt, the template has to leave the next model turn in the right state for generation.

For Gemma 4 with thinking enabled, that means the prompt should open the thought channel and let the model generate reasoning into it. For non-thinking generation, it should simply open the model turn and avoid injecting thought-channel control tokens.

The local template logic had the guard inverted. It emitted a closed empty thought block when enable_thinking was false, while thinking mode did not get the open thought channel it needed.

The shape of the fix

The change is intentionally small. I updated both Gemma 4 templates:

models/templates/google-gemma-4-31B-it.jinja
models/templates/google-gemma-4-31B-it-interleaved.jinja

The important behavioral change is this: when enable_thinking is true, the generation prompt now opens <|channel>thoughtn. When enable_thinking is false, the model turn stays open without adding a fake empty thought block.

That distinction matters. A template should not leak reasoning-control tokens into non-thinking prompts, and it should not silently block the visible reasoning path when thinking is explicitly enabled.

What I tested

I also updated tests/test-chat.cpp so the test suite exercises both sides of the branch. Existing Gemma 4 parser cases now explicitly run with thinking disabled where that is the behavior being tested, and new checks cover both shipped Gemma 4 templates.

The new checks verify two things:

Thinking mode ends the prompt with an open model turn and an open thought channel.
Non-thinking mode ends the prompt with the model turn open and does not render the empty thought block.

That is the part I care about most. The fix is only a few template lines, but the tests make the intended contract obvious for the next person who reads the code.

Validation

I validated the change in an Ubuntu 24.04 Podman container by building test-chat and running the Gemma 4 template test. I then ran the full test-chat suite.

I also built the Vulkan server image using .devops/vulkan.Dockerfile with UBUNTU_VERSION=24.04, tagged it locally as localhost/llama.cpp:server-vulkan-latest-gemma4-test, and smoke-ran llama-server --version. The server loaded the Vulkan backend and reported llama.cpp version: 8981 (d77599234).

Why publish a local fix

Model template bugs are small until they are not. One inverted guard can change whether a local model gets the right generation channel, whether reasoning is visible, and whether control tokens leak into ordinary output.

I am not placing this into the upstream ggml-org/llama.cpp codebase from here. Upstream contribution has its own process, review expectations, and timing. For now, this is a completed local fix in the CompleteTech fork: a narrow patch, a visible diff, and a validation record that others can inspect or adapt.

That is still useful. A lot of applied AI work happens in this middle state: the local fix is complete, the operational need is real, and the responsible move is to publish enough context that someone else can inspect it without pretending it is already upstream.