# ByteDance Lance: 3B unified multimodal model

> ByteDance's open-weight 3B-active-parameter model handles image and video understanding, generation, and editing, rivaling far larger models on a modest budget.

Published: 2026-05-22
URL: https://daniliants.com/insights/github-bytedance-lance-3b-active-parameter-native-unified-multimodal/
Tags: multimodal-models, video-generation, image-editing, open-weights, bytedance, small-models

---

## Summary

Lance is ByteDance's open-weight 3B-active-parameter model that handles image and video understanding, generation, and editing in one framework. It was trained from scratch on just a 128-A100-GPU budget yet matches or beats much larger 7B-20B unified models on standard benchmarks. Weights are on Hugging Face under a research license.

## Key Insight

**Punches far above its weight class:**

- Tops VBench video generation (85.11), beats Wan2.1-T2V (14B, 83.69) and Hunyuan Video.
- Ties best unified model on GenEval (0.90) at 3B vs 7B competitors like TUNA.
- Strong GEdit image-editing avg (7.30), beating BAGEL (7B) and Ovis-U1.

**Training economics are the real story:**

- Transformer backbone trained entirely from scratch (only ViT + VAE encoders reused).
- 128 A100s is a modest budget for a frontier-competitive multimodal model, signaling the cost of "good enough" generation/editing is collapsing.

**One model, six tasks:** t2i, t2v, image_edit, video_edit, plus image/video understanding (x2t). Single CLI, single weight set per modality (Lance_3B for image, Lance_3B_Video for video).

**Caveats:** needs 40GB+ VRAM (single A100/A6000 class), CUDA 12.4+, research license (not clearly commercial). Max 121 video frames at 480p.