OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

Abstract

The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering, a process consuming in excess of 10,000 GPU-days. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline.

Dataset: OpenVE-3M

A large-scale, high-quality, multi-category, and balanced dataset designed for IVE. It comprises eight categories, divided into Six Spatially-Aligned (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and Two Non-Spatially-Aligned (Camera Multi-Shot Edit and Creative Edit) Categories.

Global Style

This category involves transforming the global style of a video while preserving the original motion and details. It includes 18 common styles (e.g., Ghibli, oil painting), four times of day (e.g., Morning, Blue Hour), and three weather conditions (e.g., Sunny, Rainy, Snowy).

Method: OpenVE-Edit

OpenVE-Edit, consists of three main modules: a MLLM, a MoE-Connector, and a DiT. The input editing instruction and video are jointly fed into the MLLM to capture the semantic relationships between the instruction and the visual content. Subsequently, a task-aware MoE-Connector processes the hidden features from the MLLM, decoupling them through multiple expert networks. These processed features are then concatenated along the token dimension with the instruction features encoded by umT5. Concurrently, the latent features of the original video, derived from a VAE, are concatenated with noise along channel dimension. This composite latent representation subsequently interacts with the combined semantic editing features through the Cross-Attention mechanism in the DiT model.

Benchmar: OpenVE-Bench

OpenVE-Bench is constructed with two primary categories: Spatially-Aligned and Non-Spatially-Aligned edits. These are further divided into eight fine-grained subcategories, totaling 431 Instuction-Guided Video Editing pairs, with each subcategory containing over 50 video clips on average.

@misc{he2025openve, title={OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing}, author={Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, Lei Xie}, year={2025}, eprint={2512.07826}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.07826}, }

Abstract

Dataset: OpenVE-3M

Global Style

Background Change

Local Change

Local Remove

Local Add

Subtitles Edit

Camera Edit

Creative Edit

Method: OpenVE-Edit

Benchmar: OpenVE-Bench

BibTeX