Introducing Gemma 4 12B: a unified, encoder-free multimodal model
1 min read
Originally from blog.google
View source
My notes
Summary
Google released Gemma 4 12B, a 12-billion-parameter open model (Apache 2.0) that processes text, vision, and audio without separate encoders, fitting the whole stack into 16 GB of VRAM or unified memory. It sits between the tiny E4B edge model and the 26B MoE, hitting near-26B benchmark scores at less than half the memory cost.
Key Insight
- Encoder-free = faster + lighter. Traditional multimodal models ship a separate vision encoder and audio encoder alongside the LLM, doubling memory pressure and adding pipeline latency. Gemma 4 12B replaces the vision encoder with a single matrix multiplication + positional embedding + normalisation layer, and projects raw audio signals directly into the token embedding space. No separate model weights for modalities.
- 16 GB threshold is significant. This is the minimum RAM on a current MacBook Pro (M-series unified memory). It means Gemma 4 12B runs locally on millions of consumer laptops already in the field, not just workstations or servers.
- Multi-Token Prediction (MTP) drafters are baked in. MTP is a form of speculative decoding: the model drafts multiple tokens ahead and verifies them in parallel, typically delivering a 2-3x throughput boost at no quality cost. Having drafters pre-packaged removes an integration step for production deployments.
- Gemma Skills Repository. Google released a library of agent skills specifically written to let AI agents build with Gemma models. Meta-layer: agents helping agents adopt the model.
- Ecosystem coverage at launch. LM Studio, Ollama, Hugging Face Transformers, llama.cpp, MLX (Apple Silicon), SGLang, vLLM, Unsloth for fine-tuning, and Google Cloud (Cloud Run, GKE, Agent Platform). No friction in getting it running.
- 150 million Gemma 4 downloads across the family so far, community momentum for third-party tooling is already established.