Google Gemma 4: Open-Source, Multimodal, Apache 2.0 — and 1.5GB to Run on a Phone

Open-source AI is getting crowded. Meta has Llama, Alibaba has Qwen, Mistral keeps shipping, DeepSeek is nipping at the heels of frontier labs. On April 2, 2026, Google DeepMind jumped into the mix with Gemma 4 — a family of open-weight models derived from Gemini 3 research, and this time released under the genuinely permissive Apache 2.0 license. No MAU cap, no acceptable-use appendix, no catches.

The pitch is direct: “byte for byte, the most capable open models.” It is not entirely marketing. The 31B Dense variant sits at third place among open models on the Arena AI text leaderboard, and the 26B MoE version reaches sixth while activating just 3.8B parameters per token. The smallest variant, E2B, runs multimodal inference in under 1.5 GB of memory — small enough for a phone, a Raspberry Pi, or a mid-range IoT device.

This post walks through the whole release: model lineup, benchmarks, multimodal capabilities, Agent Skills, and every way to actually run it on hardware you own.

What is Gemma 4?

Gemma is Google DeepMind’s open-weight model family, technically descended from their flagship Gemini line. Gemma 3 shipped in 2025 and was well received. Gemma 4 is the next step, rebuilt on research from Gemini 3 with meaningful architectural and training improvements.

The release ships four sizes that span everything from laptops and workstations down to phones:

31B Dense — the full 31-billion-parameter dense model. Highest quality, best for tasks that need maximum reasoning ability.
26B A4B MoE — 26 billion total parameters, but only 3.8 billion active per token thanks to a Mixture of Experts architecture. Gets close to the 31B’s quality at a fraction of the inference cost.
E4B — “E” stands for effective. Total 8B parameters (including embeddings), ~4.5B active. The mid-range edge model.
E2B — 5.1B total, ~2.3B active. The lightest variant — this is the one that fits in 1.5 GB. IoT and smartphone first.

Every size ships in both base and instruction-tuned (IT) flavors, and the weights are available on Hugging Face, Kaggle, and the Ollama library.

Spec sheet

Model	Effective params	Total params	Layers	Context window	Modalities
E2B	2.3B	5.1B	35	128K	Text · Image · Audio
E4B	4.5B	8B	42	128K	Text · Image · Audio
26B A4B MoE	3.8B (active)	26B	—	256K	Text · Image · Video
31B Dense	31B	31B	60	256K	Text · Image · Video

A few architectural ideas worth highlighting:

Per-Layer Embeddings (PLE) — every layer gets its own embedding injection, creating a parallel conditioning path alongside the main residual stream. Helps the model handle multimodal inputs more cleanly.
Shared KV Cache — later layers reuse the Key/Value tensors from earlier ones, slashing memory and compute for long-context inference.
Hybrid attention — alternates between sliding-window local attention and global full-context attention, balancing efficiency and reach.
Variable image token budget — you can spend anywhere from 70 to 1,120 tokens on a single image, trading speed for fidelity.

Benchmarks

The headline number is the Arena AI text leaderboard: 31B Dense at 1452 Elo — third among open models. 26B A4B MoE at 1441 Elo — sixth, despite activating only 3.8B parameters at a time.

Below is the detailed comparison with last year’s Gemma 3 27B.

Reasoning and knowledge

Benchmark	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 E4B	Gemma 4 E2B	Gemma 3 27B
MMLU Pro	85.2%	82.6%	69.4%	60.0%	67.6%
AIME 2026	89.2%	88.3%	42.5%	37.5%	20.8%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%	42.4%
BigBench Hard	74.4%	64.8%	33.1%	21.9%	19.3%

Gemma 4 31B lands at 89.2% on AIME 2026, more than four times Gemma 3 27B’s 20.8%. Even E2B — a 2.3B-effective-parameter model — matches Gemma 3 27B on GPQA Diamond.

Code

Benchmark	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 E4B	Gemma 4 E2B
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%
Codeforces Elo	2150	1718	940	633

Vision

Benchmark	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 E4B	Gemma 4 E2B
MMMU Pro	76.9%	73.8%	52.6%	44.2%
MATH-Vision	85.6%	82.4%	59.5%	52.4%

Multimodal capabilities

Every Gemma 4 size is multimodal, but the supported modalities vary by tier:

Image understanding (all sizes) — variable aspect ratio and resolution, no forced square cropping. Good for object detection, captioning, OCR, and chart analysis.
Video understanding (31B, 26B A4B) — multi-frame analysis, clips up to 60 seconds.
Audio processing (E2B, E4B) — USM-style Conformer encoder, supports speech-to-text and spoken Q&A, up to 30 seconds of audio per request.
Multilingual — pretraining covers 140+ languages; post-training covers 35+.

An interesting design choice: the larger models (31B, 26B) support video but not audio, while the smaller edge models (E2B, E4B) flip it and support audio but not video. That is because the killer use case on-device is voice interaction, while the big-model workloads are more likely to involve visual documents and clips.

Agent capabilities and function calling

Agentic workflows are a major focus of this release. Gemma 4 natively supports:

Function calling — define tools, let the model decide when to invoke them, and receive properly formatted call requests.
Structured JSON output — no grammar constraint or post-processing needed; the model emits valid JSON directly.
Multi-step reasoning and planning — break complex goals into sequential steps and execute them.
System instructions — first-class support for defining agent behavior and guardrails.
Extended thinking — up to 4,000 tokens of “think longer” mode for harder problems.

Google is also pitching a new concept called Agent Skills — reusable capability bundles that let developers build fully offline autonomous workflows on the edge: extending a knowledge base, generating interactive content, chaining other models, all on-device without a network.

Actually running it on the edge

The most exciting thing about Gemma 4, for me, is the edge story. With 2-bit / 4-bit quantization, E2B runs in under 1.5 GB of memory. That means most modern smartphones — Android or iPhone — can run it.

Some numbers from the release:

Raspberry Pi 5 running E2B: ~133 tokens/s prefill, ~7.6 tokens/s decode.
With GPU acceleration, 4,000 input tokens plus two Agent Skills complete in under 3 seconds.
4× faster and 60% more energy-efficient than the previous generation.

Supported platforms: Android, iOS, Windows, Linux, macOS (Metal-accelerated), WebGPU browsers, Raspberry Pi, and Qualcomm QC8 NPU. Hardware-optimized paths are available on Google, MediaTek, and Qualcomm’s latest AI accelerators.

How to run Gemma 4

Plenty of entry points. Pick whichever matches your hardware.

On your phone

The easiest way to get Gemma 4 on a phone is the Google AI Edge Gallery app — Google Play for Android and the App Store for iOS. It downloads E2B or E4B to your device, and everything runs offline. No code required.

Google AI Edge Gallery — model download screen

Android developers have a second option: the AICore Developer Preview. That is Android’s system-level on-device AI service — your app calls an API and Gemma 4 runs in a shared runtime. The interesting part: E2B and E4B are the foundation of the future Gemini Nano 4, so code you write against Gemma 4 today will run unchanged on Gemini Nano 4 devices later.

On iPhone, besides AI Edge Gallery, you can also deploy via iOS builds of llama.cpp or Core ML-aware toolchains. Since E2B only needs 1.5 GB, recent iPhones have plenty of headroom.

One curious detail worth knowing about: the same E4B model is 3.61 GB on AI Edge Gallery but 9.6 GB on Ollama. That is almost a 3× difference. The reason is format and quantization precision. AI Edge Gallery uses Google’s LiteRT (previously TFLite), optimized for mobile GPUs and NPUs with aggressive 4-bit quantization. Ollama uses GGUF, designed for desktop CPUs and Apple Silicon, and defaults to a higher-precision quantization (typically Q4_K_M or better). The mobile version trades a bit of quality for smaller size and lower RAM use; the desktop version keeps more of the model’s detail at the cost of disk space.

Ollama

If you have not run a local model before, start with the Ollama tutorial to get the runtime installed. Once that is in place, one command boots any Gemma 4 size:

# E2B — the lightest, ideal for laptops and phones
ollama run gemma4:e2b

# E4B
ollama run gemma4:e4b

# 26B MoE — needs more RAM, but fast inference
ollama run gemma4:26b

# 31B Dense — highest quality, heaviest footprint
ollama run gemma4:31b

More info at ollama.com/library/gemma4.

llama.cpp

For finer control, llama.cpp + a GGUF quantization works well:

# macOS install
brew install llama.cpp

# Launch an OpenAI-compatible API server
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF

# Pick a quantization level (Q4_K_M is the sweet spot)
llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M

MLX on Apple Silicon

Mac users can lean on Apple’s own MLX framework, which is specifically tuned for M-series chips:

# Install mlx-vlm
pip install -U mlx-vlm

# Multimodal inference — image + text
mlx_vlm.generate \
  --model google/gemma-4-E4B-it \
  --image photo.jpg \
  --prompt "Describe this image"

# TurboQuant saves roughly 4× memory on the KV cache
mlx_vlm.generate \
  --model mlx-community/gemma-4-26B-A4B-it \
  --prompt "Explain quantum computing" \
  --kv-bits 3.5 \
  --kv-quant-scheme turboquant

Google AI Studio

If you do not want to run anything locally, the 31B and 26B MoE models are hosted on Google AI Studio — no download, just open the site and start chatting. E2B and E4B can be tried through the Google AI Edge Gallery.

Hugging Face transformers

Python developers can load Gemma 4 directly through transformers:

from transformers import pipeline

# Quick start with pipeline
pipe = pipeline("any-to-any", model="google/gemma-4-e2b-it")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "photo.jpg"},
            {"type": "text", "text": "What is in this image?"},
        ],
    }
]

output = pipe(messages, max_new_tokens=200)
print(output)

Apache 2.0 — real open source

Worth calling out the license change explicitly. Gemma 3 shipped under Google’s custom Gemma Terms of Use — free, but with a handful of restrictions. Gemma 4 switches to Apache 2.0, which means:

No monthly active user (MAU) cap.
No additional acceptable-use-policy strings attached.
Full commercial freedom.
Modify, fork, and redistribute at will.

That puts Gemma 4 on the same footing as Mistral (also Apache 2.0), and arguably more permissive than Meta’s Llama (custom license, 700 million MAU cap). For enterprise teams that have been reluctant to build on Llama because of the MAU clause, this removes a real blocker.

Closing thoughts

Gemma 4 is a serious addition to the open-weight landscape. The 31B and 26B MoE go head-to-head with the best open models for raw reasoning, while E2B and E4B push multimodal AI all the way down to phones and IoT devices. Combine that with Apache 2.0, native agent tooling, and 140+ language coverage, and it becomes a model family worth seriously considering — whether you are building a chatbot, running RAG, doing code generation, or embedding offline AI in an Android app.

If you have not tried running open-source models locally yet, start with the Ollama tutorial. Once Ollama is installed, a single ollama run gemma4:e2b gets you there.

References

Gemma 4: Byte for byte, the most capable open models — Google official announcement
Gemma 4 — Google DeepMind — DeepMind model page
Gemma 4 Model Card — Official specs and benchmarks
Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog, edge deployment deep dive
Gemma 4: The new standard for local agentic intelligence on Android — Android Developers Blog
Announcing Gemma 4 in the AICore Developer Preview — AICore Developer Preview announcement
Welcome Gemma 4: Frontier multimodal intelligence on device — Hugging Face technical walkthrough
Google announces open Gemma 4 model with Apache 2.0 license — 9to5Google coverage