ByteDance Lance: 3B unified multimodal model

1 min read
multimodal-modelsvideo-generationimage-editingopen-weightsbytedancesmall-models
View as Markdown
Originally from github.com
View source

My notes

Summary

Lance is ByteDance’s open-weight 3B-active-parameter model that handles image and video understanding, generation, and editing in one framework. It was trained from scratch on just a 128-A100-GPU budget yet matches or beats much larger 7B-20B unified models on standard benchmarks. Weights are on Hugging Face under a research license.

Key Insight

Punches far above its weight class:

  • Tops VBench video generation (85.11), beats Wan2.1-T2V (14B, 83.69) and Hunyuan Video.
  • Ties best unified model on GenEval (0.90) at 3B vs 7B competitors like TUNA.
  • Strong GEdit image-editing avg (7.30), beating BAGEL (7B) and Ovis-U1.

Training economics are the real story:

  • Transformer backbone trained entirely from scratch (only ViT + VAE encoders reused).
  • 128 A100s is a modest budget for a frontier-competitive multimodal model, signaling the cost of “good enough” generation/editing is collapsing.

One model, six tasks: t2i, t2v, image_edit, video_edit, plus image/video understanding (x2t). Single CLI, single weight set per modality (Lance_3B for image, Lance_3B_Video for video).

Caveats: needs 40GB+ VRAM (single A100/A6000 class), CUDA 12.4+, research license (not clearly commercial). Max 121 video frames at 480p.