OpenVE-3M Logo

A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

Abstract

The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering, a process consuming in excess of 10,000 GPU-days. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline.

Dataset: OpenVE-3M

A large-scale, high-quality, multi-category, and balanced dataset designed for IVE. It comprises eight categories, divided into Six Spatially-Aligned (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and Two Non-Spatially-Aligned (Camera Multi-Shot Edit and Creative Edit) Categories.

InsertPipe
InsertPipe

Global Style

This category involves transforming the global style of a video while preserving the original motion and details. It includes 18 common styles (e.g., Ghibli, oil painting), four times of day (e.g., Morning, Blue Hour), and three weather conditions (e.g., Sunny, Rainy, Snowy).

Background Change

For videos with a clear foreground-background distinction, this task involves changing the background to various scenes.

Local Change

This category includes a range of edits such as object transformation, style modification, color alteration, and age progression.

Local Remove

Remove Anything

Local Add

Add Anything

Subtitles Edit

This category includes tasks for adding, removing, and modifying subtitles, featuring nine variations (three positions: top, middle, bottom).

Camera Edit

This Non-Spatial Alain task involves editing a video to switch between close-up, medium, and wide shots of the same subject, comprising a total of six transition types.

Creative Edit

This Non-Spatial Alain task involves editing an object to follow a creative instruction, where the subject's actions may change significantly.

Method: OpenVE-Edit

OpenVE-Edit, consists of three main modules: a MLLM, a MoE-Connector, and a DiT. The input editing instruction and video are jointly fed into the MLLM to capture the semantic relationships between the instruction and the visual content. Subsequently, a task-aware MoE-Connector processes the hidden features from the MLLM, decoupling them through multiple expert networks. These processed features are then concatenated along the token dimension with the instruction features encoded by umT5. Concurrently, the latent features of the original video, derived from a VAE, are concatenated with noise along channel dimension. This composite latent representation subsequently interacts with the combined semantic editing features through the Cross-Attention mechanism in the DiT model.

InsertPipe

Benchmar: OpenVE-Bench

OpenVE-Bench is constructed with two primary categories: Spatially-Aligned and Non-Spatially-Aligned edits. These are further divided into eight fine-grained subcategories, totaling 431 Instuction-Guided Video Editing pairs, with each subcategory containing over 50 video clips on average.

InsertPipe

BibTeX

@misc{he2025openve,
            title={OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing}, 
            author={Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, Lei Xie},
            year={2025},
            eprint={2512.07826},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2512.07826}, 
        }