A Visual Guide to Gemma 4

1 min read
gemma-4google-deepmindmixture-of-expertsvision-transformeron-device-llmmultimodalattention-mechanismsmodel-architecture
Originally from newsletter.maartengrootendorst.com
View source

My notes

Summary

Google DeepMind released the Gemma 4 family with four model variants (E2B, E4B, 31B dense, 26B-A4B MoE), all multimodal with image support and the smaller ones also handling audio. The architecture introduces several efficiency innovations — per-layer embeddings that store data in flash memory instead of RAM, K=V trick for global attention, and p-RoPE for better long-context handling. The 26B MoE model activates only 4B of its 26B parameters during inference, running nearly as fast as a 4B dense model.

Key Insight

  • Four model sizes spanning from tiny on-device (E2B ~2B effective, E4B ~4B effective) to a full 31B dense model, plus a 26B MoE with only 4B active parameters
  • Per-Layer Embeddings (PLE) are the key innovation for small models: a per-token, per-layer lookup table (262,144 tokens x 256 dims x N layers) stored in flash memory, not VRAM. This “reminds” the model at each layer what the token represents, making fewer parameters more expressive
  • K=V in global attention: Keys are set equal to Values in global attention layers, halving the KV-cache for those layers with minimal performance loss
  • p-RoPE (pruned RoPE): only the first 25% of embedding dimension pairs get positional encoding; low-frequency pairs are zeroed out. This prevents long-context misalignment where small rotations stack up and corrupt semantic information
  • Interleaving pattern changed: Gemma 4 ensures the last layer is always global attention (unlike Gemma 3 where it could end on local). Pattern is 4:1 for E2B, 5:1 for all others
  • MoE specifics: 128 experts, 8 activated per token, plus 1 shared expert that is 3x the size of regular experts (contains general knowledge always active)
  • Vision encoder supports variable aspect ratios via 2D RoPE (width/height positional info on separate halves of the embedding) and variable resolution via soft token budgets (70/140/280/560/1120 tokens)
  • Audio encoder (E2B/E4B only) uses a conformer architecture — mel spectrogram to chunks to convolutional downsampling, then transformer+convolution encoding
  • Sliding window sizes: 512 tokens for small models, 1024 for larger ones (reduced from 1024 in Gemma 3)