ByteDance Lance: 3B unified multimodal model
1 min read
Originally from github.com
View source
My notes
Summary
Lance is ByteDance’s open-weight 3B-active-parameter model that handles image and video understanding, generation, and editing in one framework. It was trained from scratch on just a 128-A100-GPU budget yet matches or beats much larger 7B-20B unified models on standard benchmarks. Weights are on Hugging Face under a research license.
Key Insight
Punches far above its weight class:
- Tops VBench video generation (85.11), beats Wan2.1-T2V (14B, 83.69) and Hunyuan Video.
- Ties best unified model on GenEval (0.90) at 3B vs 7B competitors like TUNA.
- Strong GEdit image-editing avg (7.30), beating BAGEL (7B) and Ovis-U1.
Training economics are the real story:
- Transformer backbone trained entirely from scratch (only ViT + VAE encoders reused).
- 128 A100s is a modest budget for a frontier-competitive multimodal model, signaling the cost of “good enough” generation/editing is collapsing.
One model, six tasks: t2i, t2v, image_edit, video_edit, plus image/video understanding (x2t). Single CLI, single weight set per modality (Lance_3B for image, Lance_3B_Video for video).
Caveats: needs 40GB+ VRAM (single A100/A6000 class), CUDA 12.4+, research license (not clearly commercial). Max 121 video frames at 480p.