Gemma 4 Was Released Without MTP Data — Here's Why That Matters

AI LLM Gemma Open Source

Google released Gemma 4 on April 3, 2026 — a family of open-weight models with strong benchmark results, multimodal capabilities, and up to 256K context. On paper, it’s an impressive release. But within hours, the community discovered something missing: the Multi-Token Prediction heads had been stripped from the public weights.

The model was trained with MTP. Google’s own LiteRT framework includes the MTP components. But the version everyone can download from HuggingFace? Standard autoregressive generation only. No speed boost. No speculative decoding.

This post explains what MTP is, why it matters, and what this decision means for anyone running Gemma 4 on their own hardware.

What Is Gemma 4?

Gemma 4 is Google DeepMind’s latest open-weight model family, released under the Apache 2.0 license. It comes in four sizes:

ModelParametersTypeNotable Features
Gemma 4 E2B2.3B effectiveDenseVision + Audio
Gemma 4 E4B4.5B effectiveDenseVision + Audio
Gemma 4 26B-A4B26B total / 4B activeMixture of ExpertsVision
Gemma 4 31B31BDenseVision

Key capabilities include native multimodal support, function calling, structured JSON output, and training on 140+ languages. The 31B variant ranks #3 on the LMArena text leaderboard.

Under the hood, Gemma 4 introduces several architectural innovations: alternating local sliding-window and global attention layers, proportional RoPE (p-RoPE), Per-Layer Embeddings (PLE), shared KV cache, and a “Keys equal Values” memory optimization.

By the numbers, this is a strong release. The problem is what isn’t in the public weights.

What Is Multi-Token Prediction?

Standard large language models generate text one token at a time. Each token requires a full forward pass through the model. The next token can’t start until the previous one is complete. This is autoregressive decoding, and it’s inherently sequential.

Diagram comparing standard autoregressive decoding (one token per step) with Multi-Token Prediction (multiple tokens per step)

Multi-Token Prediction (MTP) changes this by adding extra prediction heads to the model. Instead of predicting just the next token, the model predicts tokens N+1, N+2, N+3, and so on — all in a single forward pass.

Here’s how it works:

  1. Training phase: Additional lightweight prediction heads are trained alongside the main model. Each head learns to predict a different future position (1 ahead, 2 ahead, 3 ahead, etc.)
  2. Inference phase: The extra heads generate “draft” tokens in parallel. The main model then verifies all of them in a single forward pass.
  3. Verification: If the draft tokens match what the main model would have generated, they’re all accepted at once — skipping multiple sequential decode steps. If a draft token is wrong, generation falls back to that position.

This is closely related to speculative decoding, but with a key advantage: the draft tokens come from the model itself rather than requiring a separate, smaller “draft model.”

Architecture diagram showing how MTP prediction heads attach to the main transformer model to generate multiple draft tokens simultaneously

How Much Faster Is MTP?

The speedup depends on how often the draft tokens are correct (the “acceptance rate”). DeepSeek V3 demonstrated the real-world impact:

MetricValue
Average acceptance length2.4 tokens per verification step
Inference speedup1.8x average (up to 2.1x peak)
Output quality impactZero — all tokens verified by the main model

An acceptance rate of 2.4 means that on average, each forward pass through the main model produces 2.4 tokens instead of 1. The output is mathematically identical to standard decoding — every token is verified. You get the same quality at nearly double the speed.

Logo

Ready to grow your business?

Start your free trial today and see results within days.

What Happened with Gemma 4

A HuggingFace user (@shadowlilac ) discovered that Google’s LiteRT package for Gemma 4 contains MTP prediction heads and multi-token prediction functionality. But the publicly released weights on HuggingFace have none of it.

The MTP components were deliberately stripped:

  • No MTP heads in the checkpoint
  • No MTP in the model config
  • No MTP in the forward pass
Diagram showing Gemma 4's training included MTP heads, but the public HuggingFace release has them stripped while Google's LiteRT version retains them

Google’s Explanation

A Google engineer (@srikanta-221 ) confirmed this was intentional:

The public model exposes only a standard autoregressive interface “for broad compatibility.” MTP heads are excluded from the model config, forward pass, and checkpoint. This ensures compatibility with HuggingFace Transformers APIs and maintains consistent checkpoint and runtime behavior.

Google frames MTP as a “deployment-time optimization” rather than a core model feature. The MTP prediction heads are preserved only in the LiteRT-exported models — Google’s own on-device inference framework.

Why This Is a Problem

The explanation doesn’t hold up under scrutiny:

1. The model was trained with MTP. The capability exists. Stripping it from the release is a choice, not a technical limitation.

2. Third-party engines can’t implement it. vLLM, llama.cpp, SGLang, and other inference frameworks cannot use MTP-based speculative decoding without the prediction heads. These engines serve the vast majority of open-source LLM deployments.

3. Users get the slow version. Without MTP, Gemma 4 runs at standard autoregressive speeds. The performance gap is already visible in practice:

ModelHardwareSpeedNotes
Gemma 4 26B-A4B5060 Ti 16GB11 tok/sNo MTP, standard decoding
Qwen 3.5 35B-A3B5060 Ti 16GB60+ tok/sComparable MoE model
Gemma 4 E4BRTX 4090 (vLLM)~9 tok/sFlashAttention fallback issues

4. It creates ecosystem lock-in. Google’s own LiteRT framework gets the speed advantage. Everyone else gets a slower model. For an “open-weight” Apache 2.0 release, this is a significant asymmetry.

How Speculative Decoding Works (and Why MTP Is Better)

To understand why the missing MTP heads matter, it helps to see where MTP fits in the evolution of inference optimization.

Comparison of three speculative decoding approaches: traditional (separate draft model), speculative-speculative, and MTP (built-in prediction heads)

Approach 1: Traditional Speculative Decoding

A separate, smaller “draft model” proposes tokens. The main model verifies them in parallel. If the drafts are correct, multiple tokens are accepted per step.

  • Pros: Works with any model pair
  • Cons: Requires maintaining and loading a second model; draft model quality limits speedup; extra memory overhead

Approach 2: MTP (Built-in Prediction Heads)

The main model has its own lightweight prediction heads that generate draft tokens. No separate model needed.

  • Pros: No extra model needed; tighter integration means higher acceptance rates; lower memory overhead
  • Cons: Only works if the prediction heads are included in the release

Why MTP Wins

MTP prediction heads are trained alongside the main model. They share the same internal representations and learn the model’s own token distribution. This typically produces higher acceptance rates than an external draft model, which means more tokens accepted per verification step and faster generation overall.

The prediction heads are also small — typically adding only 1-3% to the model’s total parameter count. The memory overhead is negligible compared to loading a separate draft model.

The Broader Impact

This isn’t just about Gemma 4. The decision sets a precedent for how “open” open-weight releases actually are.

What users lose:

  • MTP-based speculative decoding on any third-party inference engine
  • The ability to fine-tune or experiment with the MTP heads
  • Performance parity with Google’s own deployment tools

What users still have:

  • The base model weights (which are genuinely good)
  • Traditional speculative decoding using a separate draft model (vLLM issue #38893 tracks Eagle3 support for Gemma 4)
  • Standard quantization and optimization techniques

Community response has been direct. The 24-hour consensus was that Gemma 4’s benchmark results are competitive — it ties with or slightly trails Qwen 3.5 — but the product “isn’t finished.” Speed, stability, and tooling need work. Additional issues include HuggingFace Transformers initially lacking Gemma 4 architecture support, PEFT not handling the new layer types, and Mac users experiencing crashes loading larger models.

What Can You Do?

If you’re evaluating Gemma 4 for deployment, here are practical options:

Use traditional speculative decoding. External draft models can still accelerate Gemma 4 inference. Frameworks like vLLM are adding Eagle3 speculative decoding support specifically for Gemma 4. The speedup won’t match built-in MTP, but it’s better than nothing.

Consider alternatives for speed-critical workloads. Qwen 3.5 delivers significantly better tokens-per-second on equivalent hardware. If inference speed is your primary constraint, Qwen currently offers a better speed-to-quality ratio.

Watch for community workarounds. The LiteRT exports contain the MTP heads. Researchers may find ways to extract and reattach them to the HuggingFace weights, though Google has not officially supported this path.

Provide feedback. Google’s engineers are actively monitoring the HuggingFace discussion threads. Clear, technical requests for MTP head release carry weight.

Conclusion

Gemma 4 is a capable model family with genuine architectural innovations and strong benchmark results. The decision to strip MTP prediction heads from the public release — while retaining them in Google’s own LiteRT framework — undermines the “open” in open-weight.

MTP is not a minor optimization. It can deliver 1.5–2x inference speedups with zero impact on output quality. Withholding it from the public weights while the model was clearly trained with it creates a two-tier system: fast inference for Google’s tools, slow inference for everyone else.

For the open-source AI community, the message is clear: check what’s actually in the weights, not just the benchmarks. An open license doesn’t always mean an open release.


Built with FlowHunt . Stay up to date with the latest developments in open-source AI on our blog .

Frequently asked questions

Viktor Zeman is a co-owner of QualityUnit. Even after 20 years of leading the company, he remains primarily a software engineer, specializing in AI, programmatic SEO, and backend development. He has contributed to numerous projects, including LiveAgent, PostAffiliatePro, FlowHunt, UrlsLab, and many others.

Viktor Zeman
Viktor Zeman
CEO, AI Engineer

Build AI Workflows with the Best Models

FlowHunt lets you build automated AI pipelines using cloud APIs and open-source models — with full control over speed, cost, and quality.

Learn more

What is Google Gemini AI Chatbot?
What is Google Gemini AI Chatbot?

What is Google Gemini AI Chatbot?

Discover what Google Gemini is, how it works, and how it compares to ChatGPT. Learn about its multimodal capabilities, pricing, and real-world applications for ...

11 min read