Gemma 4 E2B vs E4B Benchmark: The Hidden Thinking Mode That Makes the Smaller Model 20× Slower

Gemma 4’s edge variants — E2B and E4B — are marketed as the “phone-and-Pi” tier of the family: tiny, open-weight, fast. The obvious assumption is that the smaller one is faster. On paper, gemma4:e2b ships with 2B effective parameters to E4B’s 4B, so surely it sails ahead?

It does, until it doesn’t. In our tests on an RTX 3070 8GB box, E2B’s raw token generation is about 35–40% faster than E4B. But on short practical tasks — classification, extraction, translation, commit messages — E4B finishes 5 to 10 times faster, and the time-to-first-token on short prompts is 20× worse on E2B. That contradiction turned out to be the interesting story, and tracing it all the way down led to an <|think|> token that Ollama’s gemma4 renderer silently injects for E2B, against what the official docs say.

This post lays out the full benchmark, the thinking-mode detective work, and a best-practice cheat sheet. If you are just here for the “which model should I run locally” answer, skip to the best practices section at the end.

Test setup

Everything ran on the same machine:

CPU — AMD Ryzen 5 5600X (6C/12T)
GPU — NVIDIA RTX 3070, 8GB VRAM
RAM — 32GB DDR4-3200
OS — Windows 11 Pro
Ollama — exposed on LAN; Python client running on a separate machine hitting the REST API

The two models under test:

gemma4:e2b — 5.1B total parameters, 2B effective, Q4_K_M
gemma4:e4b — 8.0B total, 4B effective, Q4_K_M (same SHA digest as gemma4:latest)

The “E” stands for effective parameters. Ollama displays the full parameter count, but the MatFormer architecture only activates a slice at inference time.

Here is the core Python we used to pull TPS out of the /api/generate endpoint:

import requests

resp = requests.post('http://192.168.51.202:11434/api/generate', json={
    'model': 'gemma4:e4b',
    'prompt': 'Explain what Docker is in 100 words.',
    'stream': False,
})
data = resp.json()

eval_tps = data['eval_count'] / (data['eval_duration'] / 1e9)
print(f'Generation speed: {eval_tps:.1f} t/s')

eval_duration is in nanoseconds, eval_count is generated tokens. Divide and you have generation-phase tokens per second.

Raw TPS benchmark

First, does generation speed degrade as output length grows? Three prompt sizes, gemma4:e4b:

Test	Generated tokens	Prompt processing	Generation speed	Total time
Short	15	263.5 t/s	34.0 t/s	5.02s (incl. 4.38s cold)
Medium	183	13.5 t/s	31.1 t/s	8.19s
Long	2,652	662.5 t/s	29.7 t/s	91.25s

A few things jump out:

Generation holds at ~30 t/s regardless of output length.
Cold start is 4–5 seconds on first load; subsequent calls are instant once the model is resident in VRAM.
Prompt processing on a warm cache is fast enough that it is not the bottleneck.
30 t/s is fine for conversation but feels tight inside an agent loop that emits a lot of tokens.

Does context length hurt speed?

A common worry is that cranking num_ctx up to your VRAM limit will tax generation speed. Measured:

Context length	Generation speed	Prompt processing
4,096	30.2 t/s	294.6 t/s
8,192	29.6 t/s	297.4 t/s
16,384	30.0 t/s	297.5 t/s
32,768	30.2 t/s	288.9 t/s
65,536	30.0 t/s	269.1 t/s

No meaningful impact. 4K to 64K all sit at ~30 t/s. Prompt processing at 64K is marginally slower (269 vs ~295 t/s) but the difference is noise. On an RTX 3070 running Gemma 4 E4B Q4_K_M the bottleneck is compute, not memory bandwidth, and KV cache on short prompts is small either way.

Practical takeaway: just set num_ctx to whatever your VRAM can hold. Free headroom for context-hungry workloads like Claude Code.

Time to first token

Throughput is not the whole story. For agent-style workflows, TTFT — the time from request to first token — is what you feel. A run that ends at 30 t/s but sits silent for seven seconds before emitting anything is a bad user experience.

Streaming mode, three samples each:

Test	Average TTFT	Individual runs
Short	1,068ms	2,427ms (cold), 391ms, 386ms
Medium	7,301ms	9,753ms, 10,222ms, 1,927ms
Long	22,009ms	23,000ms, 23,921ms, 19,106ms

Short prompts settle at ~390ms after cold start — snappy. Medium and long sit at 7 to 22 seconds before the first token, which is strange: prompt processing is fast, so what is the model doing during the delay? The answer came later, inside the thinking-mode detour. Keep this number in mind.

E2B vs E4B raw speed

Intuition says E2B should be faster across the board. Measured:

Metric	gemma4:e2b	gemma4:e4b
Generation (short)	46.0 t/s	33.6 t/s
Generation (medium)	40.0 t/s	30.5 t/s
Generation (long)	41.9 t/s	30.2 t/s
TTFT (short)	~5,863ms	~300ms
TTFT (medium)	~6,338ms	~7,501ms
TTFT (long)	~14,641ms	~16,987ms

E2B is indeed 35–40% faster at raw generation. But on short prompts, its TTFT is nearly 20× slower than E4B (5.8s vs 0.3s). If you were planning to pick E2B for interactive apps, this is a red flag: the headline number says fast, the lived experience says sluggish.

The contradiction resolves once we look at thinking mode. First, quality.

Quality on hard tasks

Five tasks across programming, logic, math, summarization, and creative writing. Scored out of 10 per task:

Code — write a Python function that returns the second-largest unique value in a list, handling edge cases.
Logic — given A>B, B>C, D>A, C>E, sort five people by height.
Math — apples cost $2 for 3; how much for 10, with work shown.
Summarization — compress a paragraph about Git into exactly two sentences.
Creative — write a haiku about debugging at 3am.

Task	E2B	E4B	Notes
Code	8	8	Both correct, both handle edge cases
Logic	9	8	Both land on D>A>B>C>E; E2B’s derivation is clearer
Math	7	9	Both hit $6.67; E4B reasons about integer groupings cleanly
Summarization	7	9	E2B drops Linus Torvalds and Bitbucket; E4B keeps them
Creative	7	9	E4B gives concrete imagery (cold screen, coffee); E2B is abstract
Total	38/50	43/50

E4B wins on comprehension, detail fidelity, and creative richness. E2B is fine on code and logic but loses nuance on summarization and creative prompts. The trade is roughly 35–40% faster generation against ~10% worse quality — pick your poison by use case.

The plot twist: practical tasks

Those five tasks were arguably unfair to a 5B-class model — they are the natural turf of 70B+ cloud models. So we flipped it and ran tasks that small models should own: short input, tight scope, short output.

Sentiment — classify a product review as positive, negative, or neutral.
Keyword extraction — pull 5 keywords from a Docker blurb.
Structured extraction — turn an email into JSON (sender, date, subject, action_required).
Shell command — write a find command for .log files modified in the last 24 hours under /var/log.
Short translation — translate an English system message into another language.
Commit message — write a Conventional Commits line for a described diff.

Results:

Task	E2B time / tokens	E4B time / tokens
Sentiment	15.4s / 165	13.9s / 2*
Keyword extraction	7.4s / 280	0.74s / 13
Structured extraction	7.9s / 299	2.28s / 63
Shell command	7.8s / 301	0.91s / 19
Short translation	14.9s / 588	1.10s / 23
Commit message	0.5s / 11	0.76s / 15

*includes cold start

On these, E4B is 5 to 10 times faster than E2B — despite being slower in the raw TPS chart. Look at the token column: E2B burns 165 to 588 tokens per task while E4B spends 2 to 63. E2B is doing a lot of invisible generation the user never sees.

Quality is roughly level. For keyword extraction, E2B gave Docker, containers, virtualization, software, packages; E4B gave Docker, virtualization, containers, OS-level, packages — the latter slightly more precise with OS-level. For commit messages, E2B produced feat: add exponential backoff retry to API client and E4B produced feat(api): add retry mechanism with exponential backoff for network failures — the scoped form is more idiomatic.

For tight, well-specified tasks, E4B is the better choice — faster in wall-clock time, a hair better in quality, and it does not wander off to think.

Parameter cheat sheet

Before the thinking-mode autopsy, a quick refresher on the sampling knobs. Ollama’s defaults for gemma4:

temperature 1
top_k       64
top_p       0.95

temperature — randomness. 0 always picks the top-probability token (deterministic), 1 samples from the raw distribution, >1 gets more creative/divergent. Use 0–0.3 for code and structured extraction; 0.8–1.2 for creative writing.
top_k — sample only from the top k candidates and discard the rest. Lower is more conservative. Gemma 4’s default of 64 is on the permissive side.
top_p (nucleus sampling) — include candidates in descending probability until the cumulative mass reaches p. Used together with top_k; the intersection is what gets sampled.
think — a boolean Ollama exposes on chat models that toggles “reason-then-answer” mode. Orthogonal to the three samplers above. Gemma 4 specific.

The first three change how a token is picked from the candidates. They do not explain E2B’s slowdown. The culprit is think.

The thinking mode detour

Verify with the `think` parameter

Ollama’s /api/generate accepts a think boolean:

resp = requests.post('http://192.168.51.202:11434/api/generate', json={
    'model': 'gemma4:e2b',
    'prompt': '...',
    'stream': False,
    'think': False,  # force-disable thinking
})
data = resp.json()
print(data['response'])   # final answer
print(data['thinking'])   # thinking trace (populated when think=True)

Running the short-translation prompt across five configurations:

Configuration	Time	Tokens	Notes
E2B default (no `think` arg)	20.0s	583	Thinking content stripped by parser
E2B `think=False`	0.73s	18	Thinking skipped
E2B `think=True`	13.8s	541	Thinking trace in `thinking` field (1,589 chars)
E4B default	6.35s	20	No thinking
E4B `think=True`	0.84s	18	No thinking behavior observed

Key findings:

E2B has thinking enabled by default. It generates ~500 extra internal tokens that Ollama’s gemma4 parser filters out before returning the response. You never see them, but you pay for them.
Setting think=False makes E2B 20× faster — right in line with E4B.
Explicit think=True puts the final answer in response and the full reasoning trace (1,589 characters here) in a separate thinking field.
E4B does not support thinking mode at all. Neither think=True nor think=False changes its behavior.

Does disabling thinking hurt quality?

A 20× speedup is seductive — but at what cost? Same six practical tasks, E2B default vs think=False:

Task	E2B default	E2B `think=False`	Difference
Sentiment	`NEGATIVE`	`NEGATIVE`	Identical
Keyword extraction	Docker, containers, virtualization, software, packages	Docker, virtualization, containers, software, packages	Order only
Structured extraction	`"Submit your reports by Friday"`	`"submit your reports by Friday"`	Capitalization
Shell command	`find ... -mtime -1`	`find ... -mtime -1`	Identical
Short translation	natural phrasing	slightly awkward word order	Minor
Commit message	`feat: add exponential backoff retry...`	`feat: add retry mechanism with exponential backoff...`	Both good

For well-specified tasks, disabling thinking costs almost nothing and gains 10–20× speed. It is a free lunch on this kind of workload. For harder tasks — multi-step reasoning, creative writing — you likely want thinking on, which means either E2B with think=True or a bigger model entirely.

Why does E2B think by default? The docs disagree

Here is where it gets interesting. Google’s official Thinking mode in Gemma page says Gemma 4 thinking is opt-in — you have to prepend an <|think|> control token to the system prompt. Ollama’s own gemma4:e2b page echoes this.

Observed behavior says otherwise. E2B on Ollama clearly runs with thinking enabled unless you explicitly turn it off. So who is injecting the token?

`raw=true` catches the culprit

raw=true on Ollama bypasses the built-in chat template / renderer and ships the prompt to the model verbatim. We hand-rolled two templates per the official format:

# no thinking
<|turn>user
[Prompt]<turn|>
<|turn>model

# with thinking
<|turn>system
<|think|><turn|>
<|turn>user
[Prompt]<turn|>
<|turn>model

Same short-translation prompt, three configurations per model:

Test	E2B tokens	E4B tokens
Ollama default (`raw=False`)	577 (thinking)	20 (no thinking)
Manual no-think + `raw=True`	18 (no thinking)	19 (no thinking)
Manual with-think + `raw=True`	552 (thinking)	19 (no thinking)

Chain of evidence:

E2B “Ollama default 577” ≈ E2B “manual with-think 552” → Ollama is injecting <|think|> into E2B’s prompt by default.
E4B ignores <|think|> even when manually supplied → E4B was never trained to respond to the token.
E2B with a manual no-think template drops to 18 tokens, matching the API’s think=False path.

Verdict: Ollama’s gemma4 renderer quietly enables thinking for E2B by default, contradicting both Google’s and Ollama’s own documentation that call it opt-in. Whether this is a deliberate special case or a bug, the practical upshot is the same: if you want E2B to be fast, pass think=False.

A note on running Claude Code locally

A natural next question: can you point Claude Code at a local Ollama and skip the cloud API bill? Yes, with caveats.

The basic wiring is simple. Pull the model, then tell Claude Code where Ollama lives:

ollama pull gemma4:e4b

In Claude Code’s settings.json, set the base URL to your local Ollama endpoint and pick gemma4:e4b as the model. Two concrete things matter:

Use E4B, not E2B. Everything above applies here — E2B’s default thinking mode turns every tool call into a multi-second wait, and Claude Code makes a lot of tool calls.
Set num_ctx to at least 32K. Claude Code’s system prompt and context accumulation get crowded fast; anything smaller and you will see truncation issues.

Expect the experience to be noticeably slower and less reliable than Claude’s own hosted models. A local 4B model will occasionally overwrite files whole instead of making targeted edits, and its tool-use judgment lags behind frontier models. Treat it as “learn how the stack works” or “automate cheap text-processing jobs” rather than “daily driver for a real codebase.” The best home for gemma4:e4b in your workflow is probably a cron job, a Git hook, or a Slack bot — places where free and offline matters more than raw quality.

Best practices for local edge models

Pulled together from everything above:

Need reasoning? E2B + think=True. Deeper answers at the cost of speed.
Clear, specified tasks? E4B (or E2B + think=False). Classification, extraction, translation, commit messages — 10–20× faster with negligible quality loss.
Agent workflows like Claude Code? E4B. E2B’s default thinking makes the interaction painfully slow.
Context length? Max it out. 4K and 64K run at the same speed; bigger context saves you from overflow pain.
Sampling parameters? Defaults are fine. Lower temperature to 0.1–0.3 for deterministic tasks; raise toward 1.0+ for creative ones.

A Python template that encodes most of this for fast, well-specified jobs:

import requests

def quick_task(prompt: str) -> str:
    """Tight, deterministic tasks — maximum speed."""
    resp = requests.post('http://192.168.51.202:11434/api/generate', json={
        'model': 'gemma4:e4b',
        'prompt': prompt,
        'stream': False,
        'think': False,
        'options': {
            'temperature': 0.2,
            'num_ctx': 8192,
        },
    })
    return resp.json()['response'].strip()

msg = quick_task('Write a conventional commit message for: added retry logic to API client')
print(msg)

Wrap-up

What started as a “how fast is Gemma 4 on an RTX 3070?” benchmark turned into an accidental discovery: Ollama’s gemma4 renderer enables thinking for E2B by default, against what both Google and Ollama document. That single quirk explains the entire paradox — why E2B is faster on paper but slower in practice, why its TTFT is wildly worse on short prompts, and why think=False produces a 20× speedup with near-zero quality cost on practical tasks.

Net-net: on a single 8GB consumer GPU, gemma4:e4b with think=False is the best default for small-model workloads. Save E2B + thinking for genuinely hard problems, and keep your heavy reasoning on a cloud model.