video-use: edit videos with Claude Code via transcripts
1 min read
Originally from github.com
View source
My notes
Summary
Video-use is an open-source Claude Code skill that turns raw video footage into edited final cuts through conversation. Instead of frame-dumping (which would cost ~45M tokens), it reads video through word-level audio transcripts (~12KB) plus on-demand visual filmstrips, achieving production-quality cuts at a fraction of the token cost.
Key Insight
- The core architectural insight is treating video the same way browser-use treats web pages: give the LLM a structured representation (transcript + selective filmstrips) instead of raw pixels. This reduces 30,000 frames x 1,500 tokens to ~12KB text + a handful of PNGs.
- Uses ElevenLabs Scribe for word-level timestamps and speaker diarization, enabling word-boundary-precise cuts.
- Automatic production polish: filler word removal (umm, uh, false starts), dead space trimming, 30ms audio fades at cuts, auto color grading, and burned-in subtitles.
- Self-evaluation loop runs
timeline_viewon rendered output at every cut boundary, catching visual jumps and audio pops before showing the user. Max 3 fix-and-re-render cycles. - Generates animation overlays via Manim or Remotion, spawning parallel sub-agents per animation.
- Session memory persists in
project.mdso editing sessions can span multiple days. - Installed as a Claude Code skill via symlink to
~/.claude/skills/video-use.