Gemma 4 Was Released Without MTP Data — Here's Why That Matters
Google stripped MTP prediction heads from Gemma 4’s public release while keeping them in its own LiteRT framework. Here’s what that means for inference speed and open-source AI.
AI
LLM
Gemma
Open Source
Inference
Multi-Token Prediction
Google released Gemma 4 on April 3, 2026 — a family of open-weight models with strong benchmark results, multimodal capabilities, and up to 256K context. On paper, it’s an impressive release. But within hours, the community discovered something missing: the Multi-Token Prediction heads had been stripped from the public weights.
The model was trained with MTP. Google’s own LiteRT framework includes the MTP components. But the version everyone can download from HuggingFace? Standard autoregressive generation only. No speed boost. No speculative decoding.
This post explains what MTP is, why it matters, and what this decision means for anyone running Gemma 4 on their own hardware.
What Is Gemma 4?
Gemma 4 is Google DeepMind’s latest open-weight model family, released under the Apache 2.0 license. It comes in four sizes:
Model
Parameters
Type
Notable Features
Gemma 4 E2B
2.3B effective
Dense
Vision + Audio
Gemma 4 E4B
4.5B effective
Dense
Vision + Audio
Gemma 4 26B-A4B
26B total / 4B active
Mixture of Experts
Vision
Gemma 4 31B
31B
Dense
Vision
Key capabilities include native multimodal support, function calling, structured JSON output, and training on 140+ languages. The 31B variant ranks #3 on the LMArena text leaderboard.
Under the hood, Gemma 4 introduces several architectural innovations: alternating local sliding-window and global attention layers, proportional RoPE (p-RoPE), Per-Layer Embeddings (PLE), shared KV cache, and a “Keys equal Values” memory optimization.
By the numbers, this is a strong release. The problem is what isn’t in the public weights.
What Is Multi-Token Prediction?
Standard large language models generate text one token at a time. Each token requires a full forward pass through the model. The next token can’t start until the previous one is complete. This is autoregressive decoding, and it’s inherently sequential.
Multi-Token Prediction (MTP) changes this by adding extra prediction heads to the model. Instead of predicting just the next token, the model predicts tokens N+1, N+2, N+3, and so on — all in a single forward pass.
Here’s how it works:
Training phase: Additional lightweight prediction heads are trained alongside the main model. Each head learns to predict a different future position (1 ahead, 2 ahead, 3 ahead, etc.)
Inference phase: The extra heads generate “draft” tokens in parallel. The main model then verifies all of them in a single forward pass.
Verification: If the draft tokens match what the main model would have generated, they’re all accepted at once — skipping multiple sequential decode steps. If a draft token is wrong, generation falls back to that position.
This is closely related to speculative decoding, but with a key advantage: the draft tokens come from the model itself rather than requiring a separate, smaller “draft model.”
How Much Faster Is MTP?
The speedup depends on how often the draft tokens are correct (the “acceptance rate”). DeepSeek V3 demonstrated the real-world impact:
Metric
Value
Average acceptance length
2.4 tokens per verification step
Inference speedup
1.8x average (up to 2.1x peak)
Output quality impact
Zero — all tokens verified by the main model
An acceptance rate of 2.4 means that on average, each forward pass through the main model produces 2.4 tokens instead of 1. The output is mathematically identical to standard decoding — every token is verified. You get the same quality at nearly double the speed.
Ready to grow your business?
Start your free trial today and see results within days.
A HuggingFace user (@shadowlilac
) discovered that Google’s LiteRT package for Gemma 4 contains MTP prediction heads and multi-token prediction functionality. But the publicly released weights on HuggingFace have none of it.
The MTP components were deliberately stripped:
No MTP heads in the checkpoint
No MTP in the model config
No MTP in the forward pass
Google’s Explanation
A Google engineer (@srikanta-221
) confirmed this was intentional:
The public model exposes only a standard autoregressive interface “for broad compatibility.” MTP heads are excluded from the model config, forward pass, and checkpoint. This ensures compatibility with HuggingFace Transformers APIs and maintains consistent checkpoint and runtime behavior.
Google frames MTP as a “deployment-time optimization” rather than a core model feature. The MTP prediction heads are preserved only in the LiteRT-exported models — Google’s own on-device inference framework.
Why This Is a Problem
The explanation doesn’t hold up under scrutiny:
1. The model was trained with MTP. The capability exists. Stripping it from the release is a choice, not a technical limitation.
2. Third-party engines can’t implement it. vLLM, llama.cpp, SGLang, and other inference frameworks cannot use MTP-based speculative decoding without the prediction heads. These engines serve the vast majority of open-source LLM deployments.
3. Users get the slow version. Without MTP, Gemma 4 runs at standard autoregressive speeds. The performance gap is already visible in practice:
Model
Hardware
Speed
Notes
Gemma 4 26B-A4B
5060 Ti 16GB
11 tok/s
No MTP, standard decoding
Qwen 3.5 35B-A3B
5060 Ti 16GB
60+ tok/s
Comparable MoE model
Gemma 4 E4B
RTX 4090 (vLLM)
~9 tok/s
FlashAttention fallback issues
4. It creates ecosystem lock-in. Google’s own LiteRT framework gets the speed advantage. Everyone else gets a slower model. For an “open-weight” Apache 2.0 release, this is a significant asymmetry.
How Speculative Decoding Works (and Why MTP Is Better)
To understand why the missing MTP heads matter, it helps to see where MTP fits in the evolution of inference optimization.
Approach 1: Traditional Speculative Decoding
A separate, smaller “draft model” proposes tokens. The main model verifies them in parallel. If the drafts are correct, multiple tokens are accepted per step.
Pros: Works with any model pair
Cons: Requires maintaining and loading a second model; draft model quality limits speedup; extra memory overhead
Approach 2: MTP (Built-in Prediction Heads)
The main model has its own lightweight prediction heads that generate draft tokens. No separate model needed.
Pros: No extra model needed; tighter integration means higher acceptance rates; lower memory overhead
Cons: Only works if the prediction heads are included in the release
Why MTP Wins
MTP prediction heads are trained alongside the main model. They share the same internal representations and learn the model’s own token distribution. This typically produces higher acceptance rates than an external draft model, which means more tokens accepted per verification step and faster generation overall.
The prediction heads are also small — typically adding only 1-3% to the model’s total parameter count. The memory overhead is negligible compared to loading a separate draft model.
Join our newsletter
Get latest tips, trends, and deals for free.
The Broader Impact
This isn’t just about Gemma 4. The decision sets a precedent for how “open” open-weight releases actually are.
What users lose:
MTP-based speculative decoding on any third-party inference engine
The ability to fine-tune or experiment with the MTP heads
Performance parity with Google’s own deployment tools
What users still have:
The base model weights (which are genuinely good)
Traditional speculative decoding using a separate draft model (vLLM issue #38893
tracks Eagle3 support for Gemma 4)
Standard quantization and optimization techniques
Community response has been direct. The 24-hour consensus was that Gemma 4’s benchmark results are competitive — it ties with or slightly trails Qwen 3.5 — but the product “isn’t finished.” Speed, stability, and tooling need work. Additional issues include HuggingFace Transformers initially lacking Gemma 4 architecture support, PEFT not handling the new layer types, and Mac users experiencing crashes loading larger models.
What Can You Do?
If you’re evaluating Gemma 4 for deployment, here are practical options:
Use traditional speculative decoding. External draft models can still accelerate Gemma 4 inference. Frameworks like vLLM are adding Eagle3 speculative decoding support specifically for Gemma 4. The speedup won’t match built-in MTP, but it’s better than nothing.
Consider alternatives for speed-critical workloads. Qwen 3.5 delivers significantly better tokens-per-second on equivalent hardware. If inference speed is your primary constraint, Qwen currently offers a better speed-to-quality ratio.
Watch for community workarounds. The LiteRT exports contain the MTP heads. Researchers may find ways to extract and reattach them to the HuggingFace weights, though Google has not officially supported this path.
Provide feedback. Google’s engineers are actively monitoring the HuggingFace discussion threads. Clear, technical requests for MTP head release carry weight.
Conclusion
Gemma 4 is a capable model family with genuine architectural innovations and strong benchmark results. The decision to strip MTP prediction heads from the public release — while retaining them in Google’s own LiteRT framework — undermines the “open” in open-weight.
MTP is not a minor optimization. It can deliver 1.5–2x inference speedups with zero impact on output quality. Withholding it from the public weights while the model was clearly trained with it creates a two-tier system: fast inference for Google’s tools, slow inference for everyone else.
For the open-source AI community, the message is clear: check what’s actually in the weights, not just the benchmarks. An open license doesn’t always mean an open release.
Built with FlowHunt
. Stay up to date with the latest developments in open-source AI on our blog
.
Frequently asked questions
Multi-Token Prediction is a technique where an LLM predicts multiple future tokens in a single forward pass instead of one token at a time. Additional prediction heads are trained alongside the main model to draft tokens N+1, N+2, N+3, etc. simultaneously, which can then be verified in parallel by the main model. This enables 1.5–2x inference speedups with no loss in output quality.
Gemma 4 was trained with MTP prediction heads, and they are present in Google's LiteRT (on-device inference) exports. However, the publicly released HuggingFace weights have the MTP heads deliberately stripped out. Google says this was done for 'broad compatibility' with existing inference frameworks.
Without MTP heads, third-party inference engines like vLLM, llama.cpp, and SGLang cannot use built-in speculative decoding for Gemma 4. Users are stuck with standard autoregressive generation, which is significantly slower. Benchmarks show Gemma 4 generating only 11 tokens/sec on hardware where comparable models achieve 60+ tokens/sec.
Speculative decoding is an inference acceleration technique where a fast 'draft' model proposes multiple tokens at once, and the main model verifies them in a single forward pass. If the draft tokens are correct, multiple decode steps are effectively skipped. MTP is a variant where the draft tokens come from the model's own built-in prediction heads rather than a separate model.
As of April 2026, Google has not announced plans to release the MTP prediction heads for the HuggingFace weights. They are currently only available in the LiteRT-exported models, which limits their use to Google's own inference framework. The community continues to request their release.
Viktor Zeman is a co-owner of QualityUnit. Even after 20 years of leading the company, he remains primarily a software engineer, specializing in AI, programmatic SEO, and backend development. He has contributed to numerous projects, including LiveAgent, PostAffiliatePro, FlowHunt, UrlsLab, and many others.
Viktor Zeman
CEO, AI Engineer
Build AI Workflows with the Best Models
FlowHunt lets you build automated AI pipelines using cloud APIs and open-source models — with full control over speed, cost, and quality.
Fine-Tuning Gemma 4 on Apple Silicon: Can It Replace Claude Sonnet for Content Generation?
We fine-tuned Google's Gemma 4 31B model on a MacBook Pro M3 Max to generate sports articles. Here's how it compared to Claude Sonnet in quality, speed, and cos...
Discover what Google Gemini is, how it works, and how it compares to ChatGPT. Learn about its multimodal capabilities, pricing, and real-world applications for ...