A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing
Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing. The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering, a process consuming in excess of 10,000 GPU-days. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline.
A large-scale, high-quality, multi-category, and balanced dataset designed for IVE. It comprises eight categories, divided into Six Spatially-Aligned (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and Two Non-Spatially-Aligned (Camera Multi-Shot Edit and Creative Edit) Categories.
This category involves transforming the global style of a video while preserving the original motion and details. It includes 18 common styles (e.g., Ghibli, oil painting), four times of day (e.g., Morning, Blue Hour), and three weather conditions (e.g., Sunny, Rainy, Snowy).
Local editing focuses on specific regions or objects within the video, applying precise modifications to targeted areas while preserving the surrounding content. This kind of editing enables selective enhancement, object replacement, and regional adjustments that maintain the integrity of the overall composition.
An instruction-based video editing model trained on the Ditto-1M dataset, demonstrating superior performance across diverse editing scenarios and outperforming existing methods.
Showcasing the capabilities of our Editto model across various editing scenarios, from global style transfers to precise local modifications.
We showcase the synthetic-to-real (sim2real) capability benefited from our data by training the model to map the stylized videos in our dataset back to their original, real-world source videos.
Here we demonstrate the effectiveness of denoising enhancer where the raw edited video is provided on the left, and the enhanced one is put on the right (please consider zooming in to see the details). The denoising enhancer can effectively mitigate the generation quality degradation introduced by quantized and distilled models at a low cost.