A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing
The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering, a process consuming in excess of 10,000 GPU-days. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline.
A large-scale, high-quality, multi-category, and balanced dataset designed for IVE. It comprises eight categories, divided into Six Spatially-Aligned (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and Two Non-Spatially-Aligned (Camera Multi-Shot Edit and Creative Edit) Categories.
This category involves transforming the global style of a video while preserving the original motion and details. It includes 18 common styles (e.g., Ghibli, oil painting), four times of day (e.g., Morning, Blue Hour), and three weather conditions (e.g., Sunny, Rainy, Snowy).
For videos with a clear foreground-background distinction, this task involves changing the background to various scenes.
This category includes a range of edits such as object transformation, style modification, color alteration, and age progression.
Remove Anything
Add Anything
This category includes tasks for adding, removing, and modifying subtitles, featuring nine variations (three positions: top, middle, bottom).
This Non-Spatial Alain task involves editing a video to switch between close-up, medium, and wide shots of the same subject, comprising a total of six transition types.
This Non-Spatial Alain task involves editing an object to follow a creative instruction, where the subject's actions may change significantly.
OpenVE-Edit, consists of three main modules: a MLLM, a MoE-Connector, and a DiT. The input editing instruction and video are jointly fed into the MLLM to capture the semantic relationships between the instruction and the visual content. Subsequently, a task-aware MoE-Connector processes the hidden features from the MLLM, decoupling them through multiple expert networks. These processed features are then concatenated along the token dimension with the instruction features encoded by umT5. Concurrently, the latent features of the original video, derived from a VAE, are concatenated with noise along channel dimension. This composite latent representation subsequently interacts with the combined semantic editing features through the Cross-Attention mechanism in the DiT model.
OpenVE-Bench is constructed with two primary categories: Spatially-Aligned and Non-Spatially-Aligned edits. These are further divided into eight fine-grained subcategories, totaling 431 Instuction-Guided Video Editing pairs, with each subcategory containing over 50 video clips on average.
@misc{he2025openve,
title={OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing},
author={Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, Lei Xie},
year={2025},
eprint={2512.07826},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.07826},
}