What is Multi-Token Prediction (MTP)?

Multi-Token Prediction is a technique where an LLM predicts multiple future tokens in a single forward pass instead of one token at a time. Additional prediction heads are trained alongside the main model to draft tokens N+1, N+2, N+3, etc. simultaneously, which can then be verified in parallel by the main model. This enables 1.5–2x inference speedups with no loss in output quality.

Does Gemma 4 support MTP?

Gemma 4 was trained with MTP prediction heads, and they are present in Google's LiteRT (on-device inference) exports. However, the publicly released HuggingFace weights have the MTP heads deliberately stripped out. Google says this was done for 'broad compatibility' with existing inference frameworks.

Why does removing MTP heads matter?

Without MTP heads, third-party inference engines like vLLM, llama.cpp, and SGLang cannot use built-in speculative decoding for Gemma 4. Users are stuck with standard autoregressive generation, which is significantly slower. Benchmarks show Gemma 4 generating only 11 tokens/sec on hardware where comparable models achieve 60+ tokens/sec.

What is speculative decoding?

Speculative decoding is an inference acceleration technique where a fast 'draft' model proposes multiple tokens at once, and the main model verifies them in a single forward pass. If the draft tokens are correct, multiple decode steps are effectively skipped. MTP is a variant where the draft tokens come from the model's own built-in prediction heads rather than a separate model.

Will Google release the MTP heads for Gemma 4?

As of April 2026, Google has not announced plans to release the MTP prediction heads for the HuggingFace weights. They are currently only available in the LiteRT-exported models, which limits their use to Google's own inference framework. The community continues to request their release.

Gemma 4 Was Released Without MTP Data — Here's Why That Matters

Google stripped MTP prediction heads from Gemma 4’s public release while keeping them in its own LiteRT framework. Here’s what that means for inference speed and open-source AI.

AI LLM Gemma Open Source

Get Started Read More

Google released Gemma 4 on April 3, 2026 — a family of open-weight models with strong benchmark results, multimodal capabilities, and up to 256K context. On paper, it’s an impressive release. But within hours, the community discovered something missing: the Multi-Token Prediction heads had been stripped from the public weights.

The model was trained with MTP. Google’s own LiteRT framework includes the MTP components. But the version everyone can download from HuggingFace? Standard autoregressive generation only. No speed boost. No speculative decoding.

This post explains what MTP is, why it matters, and what this decision means for anyone running Gemma 4 on their own hardware.

What Is Gemma 4?

Gemma 4 is Google DeepMind’s latest open-weight model family, released under the Apache 2.0 license. It comes in four sizes:

Model	Parameters	Type	Notable Features
Gemma 4 E2B	2.3B effective	Dense	Vision + Audio
Gemma 4 E4B	4.5B effective	Dense	Vision + Audio
Gemma 4 26B-A4B	26B total / 4B active	Mixture of Experts	Vision
Gemma 4 31B	31B	Dense	Vision

Key capabilities include native multimodal support, function calling, structured JSON output, and training on 140+ languages. The 31B variant ranks #3 on the LMArena text leaderboard.

Under the hood, Gemma 4 introduces several architectural innovations: alternating local sliding-window and global attention layers, proportional RoPE (p-RoPE), Per-Layer Embeddings (PLE), shared KV cache, and a “Keys equal Values” memory optimization.

By the numbers, this is a strong release. The problem is what isn’t in the public weights.

What Is Multi-Token Prediction?

Standard large language models generate text one token at a time. Each token requires a full forward pass through the model. The next token can’t start until the previous one is complete. This is autoregressive decoding, and it’s inherently sequential.

Diagram comparing standard autoregressive decoding (one token per step) with Multi-Token Prediction (multiple tokens per step)

Multi-Token Prediction (MTP) changes this by adding extra prediction heads to the model. Instead of predicting just the next token, the model predicts tokens N+1, N+2, N+3, and so on — all in a single forward pass.

Here’s how it works:

Training phase: Additional lightweight prediction heads are trained alongside the main model. Each head learns to predict a different future position (1 ahead, 2 ahead, 3 ahead, etc.)
Inference phase: The extra heads generate “draft” tokens in parallel. The main model then verifies all of them in a single forward pass.
Verification: If the draft tokens match what the main model would have generated, they’re all accepted at once — skipping multiple sequential decode steps. If a draft token is wrong, generation falls back to that position.

This is closely related to speculative decoding, but with a key advantage: the draft tokens come from the model itself rather than requiring a separate, smaller “draft model.”

Architecture diagram showing how MTP prediction heads attach to the main transformer model to generate multiple draft tokens simultaneously

How Much Faster Is MTP?

The speedup depends on how often the draft tokens are correct (the “acceptance rate”). DeepSeek V3 demonstrated the real-world impact:

Metric	Value
Average acceptance length	2.4 tokens per verification step
Inference speedup	1.8x average (up to 2.1x peak)
Output quality impact	Zero — all tokens verified by the main model

An acceptance rate of 2.4 means that on average, each forward pass through the main model produces 2.4 tokens instead of 1. The output is mathematically identical to standard decoding — every token is verified. You get the same quality at nearly double the speed.

What Happened with Gemma 4

A HuggingFace user (@shadowlilac ) discovered that Google’s LiteRT package for Gemma 4 contains MTP prediction heads and multi-token prediction functionality. But the publicly released weights on HuggingFace have none of it.

The MTP components were deliberately stripped:

No MTP heads in the checkpoint
No MTP in the model config
No MTP in the forward pass

Diagram showing Gemma 4's training included MTP heads, but the public HuggingFace release has them stripped while Google's LiteRT version retains them

Google’s Explanation

A Google engineer (@srikanta-221 ) confirmed this was intentional:

The public model exposes only a standard autoregressive interface “for broad compatibility.” MTP heads are excluded from the model config, forward pass, and checkpoint. This ensures compatibility with HuggingFace Transformers APIs and maintains consistent checkpoint and runtime behavior.

Google frames MTP as a “deployment-time optimization” rather than a core model feature. The MTP prediction heads are preserved only in the LiteRT-exported models — Google’s own on-device inference framework.

Why This Is a Problem

The explanation doesn’t hold up under scrutiny:

1. The model was trained with MTP. The capability exists. Stripping it from the release is a choice, not a technical limitation.

2. Third-party engines can’t implement it. vLLM, llama.cpp, SGLang, and other inference frameworks cannot use MTP-based speculative decoding without the prediction heads. These engines serve the vast majority of open-source LLM deployments.

3. Users get the slow version. Without MTP, Gemma 4 runs at standard autoregressive speeds. The performance gap is already visible in practice:

Model	Hardware	Speed	Notes
Gemma 4 26B-A4B	5060 Ti 16GB	11 tok/s	No MTP, standard decoding
Qwen 3.5 35B-A3B	5060 Ti 16GB	60+ tok/s	Comparable MoE model
Gemma 4 E4B	RTX 4090 (vLLM)	~9 tok/s	FlashAttention fallback issues

4. It creates ecosystem lock-in. Google’s own LiteRT framework gets the speed advantage. Everyone else gets a slower model. For an “open-weight” Apache 2.0 release, this is a significant asymmetry.

How Speculative Decoding Works (and Why MTP Is Better)

To understand why the missing MTP heads matter, it helps to see where MTP fits in the evolution of inference optimization.

Comparison of three speculative decoding approaches: traditional (separate draft model), speculative-speculative, and MTP (built-in prediction heads)

Approach 1: Traditional Speculative Decoding

A separate, smaller “draft model” proposes tokens. The main model verifies them in parallel. If the drafts are correct, multiple tokens are accepted per step.

Pros: Works with any model pair
Cons: Requires maintaining and loading a second model; draft model quality limits speedup; extra memory overhead

Approach 2: MTP (Built-in Prediction Heads)

The main model has its own lightweight prediction heads that generate draft tokens. No separate model needed.

Pros: No extra model needed; tighter integration means higher acceptance rates; lower memory overhead
Cons: Only works if the prediction heads are included in the release

Why MTP Wins

MTP prediction heads are trained alongside the main model. They share the same internal representations and learn the model’s own token distribution. This typically produces higher acceptance rates than an external draft model, which means more tokens accepted per verification step and faster generation overall.

The prediction heads are also small — typically adding only 1-3% to the model’s total parameter count. The memory overhead is negligible compared to loading a separate draft model.

The Broader Impact

This isn’t just about Gemma 4. The decision sets a precedent for how “open” open-weight releases actually are.

What users lose:

MTP-based speculative decoding on any third-party inference engine
The ability to fine-tune or experiment with the MTP heads
Performance parity with Google’s own deployment tools

What users still have:

The base model weights (which are genuinely good)
Traditional speculative decoding using a separate draft model (vLLM issue #38893 tracks Eagle3 support for Gemma 4)
Standard quantization and optimization techniques

Community response has been direct. The 24-hour consensus was that Gemma 4’s benchmark results are competitive — it ties with or slightly trails Qwen 3.5 — but the product “isn’t finished.” Speed, stability, and tooling need work. Additional issues include HuggingFace Transformers initially lacking Gemma 4 architecture support, PEFT not handling the new layer types, and Mac users experiencing crashes loading larger models.

What Can You Do?

If you’re evaluating Gemma 4 for deployment, here are practical options:

Use traditional speculative decoding. External draft models can still accelerate Gemma 4 inference. Frameworks like vLLM are adding Eagle3 speculative decoding support specifically for Gemma 4. The speedup won’t match built-in MTP, but it’s better than nothing.

Consider alternatives for speed-critical workloads. Qwen 3.5 delivers significantly better tokens-per-second on equivalent hardware. If inference speed is your primary constraint, Qwen currently offers a better speed-to-quality ratio.

Watch for community workarounds. The LiteRT exports contain the MTP heads. Researchers may find ways to extract and reattach them to the HuggingFace weights, though Google has not officially supported this path.

Provide feedback. Google’s engineers are actively monitoring the HuggingFace discussion threads. Clear, technical requests for MTP head release carry weight.

Conclusion

Gemma 4 is a capable model family with genuine architectural innovations and strong benchmark results. The decision to strip MTP prediction heads from the public release — while retaining them in Google’s own LiteRT framework — undermines the “open” in open-weight.

MTP is not a minor optimization. It can deliver 1.5–2x inference speedups with zero impact on output quality. Withholding it from the public weights while the model was clearly trained with it creates a two-tier system: fast inference for Google’s tools, slow inference for everyone else.

For the open-source AI community, the message is clear: check what’s actually in the weights, not just the benchmarks. An open license doesn’t always mean an open release.

Built with FlowHunt . Stay up to date with the latest developments in open-source AI on our blog .

Frequently asked questions

: Multi-Token Prediction is a technique where an LLM predicts multiple future tokens in a single forward pass instead of one token at a time. Additional prediction heads are trained alongside the main model to draft tokens N+1, N+2, N+3, etc. simultaneously, which can then be verified in parallel by the main model. This enables 1.5–2x inference speedups with no loss in output quality.
: Gemma 4 was trained with MTP prediction heads, and they are present in Google's LiteRT (on-device inference) exports. However, the publicly released HuggingFace weights have the MTP heads deliberately stripped out. Google says this was done for 'broad compatibility' with existing inference frameworks.
: Without MTP heads, third-party inference engines like vLLM, llama.cpp, and SGLang cannot use built-in speculative decoding for Gemma 4. Users are stuck with standard autoregressive generation, which is significantly slower. Benchmarks show Gemma 4 generating only 11 tokens/sec on hardware where comparable models achieve 60+ tokens/sec.
: Speculative decoding is an inference acceleration technique where a fast 'draft' model proposes multiple tokens at once, and the main model verifies them in a single forward pass. If the draft tokens are correct, multiple decode steps are effectively skipped. MTP is a variant where the draft tokens come from the model's own built-in prediction heads rather than a separate model.
: As of April 2026, Google has not announced plans to release the MTP prediction heads for the HuggingFace weights. They are currently only available in the LiteRT-exported models, which limits their use to Google's own inference framework. The community continues to request their release.

Build AI Workflows with the Best Models

FlowHunt lets you build automated AI pipelines using cloud APIs and open-source models — with full control over speed, cost, and quality.

Get Started Read More

Learn more

Fine-Tuning Gemma 4 on Apple Silicon: Can It Replace Claude Sonnet for Content Generation?

We fine-tuned Google's Gemma 4 31B model on a MacBook Pro M3 Max to generate sports articles. Here's how it compared to Claude Sonnet in quality, speed, and cos...

Apr 6, 2026 10 min read

AI LLM +6

What is Google Gemini AI Chatbot?

Discover what Google Gemini is, how it works, and how it compares to ChatGPT. Learn about its multimodal capabilities, pricing, and real-world applications for ...

Dec 1, 2025 11 min read

ChatGPT-5: Everything You Need to Know About OpenAI’s Breakthrough AI Model

Explore ChatGPT-5’s groundbreaking advancements, use cases, benchmarks, security, pricing, and future directions in this definitive FlowHunt guide.

Oct 4, 2025 7 min read

chatgpt 5 gpt-5 +1

Gemma 4 Was Released Without MTP Data — Here's Why That Matters

What Is Gemma 4?