Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Yiping Wang¹, Xuehai He², Kuan Wang³, Luyao Ma⁴, Jianwei Yang⁵, Shuohang Wang⁵, Simon Shaolei Du¹ Yelong Shen⁵

¹University of Washington, ²University of California, Santa Cruz, ³Georgia Institute of Technology, ⁴University of California, San Diego, ⁵Microsoft

Paper Code Examples

Abstract

The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios.

For example, top T2V generative models still fail to generate a video of the short simple story ''how to put an elephant into a refrigerator.'' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation.

To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2–4 consecutive events. We employ Vision-Language Models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.

StoryEval Evaluation Results on 11 Text-to-Video Generative Models

We visualize their completion rates for the stories across 7 classes, along with the average result for the entire set. Even the best model achieves an average completion rate of less than 50%, meaning it can successfully present fewer than half of the events in a simple short story on average.

Pipeline of StoryEval Evaluation

We carefully design 423 prompts across 7 classes, and each prompt illustrates a short story containing 2-4 sequential events. For evaluation, we choose 3 top closed-source commercial models and 8 well-known open-source models, use them for text-to-video generation, and then combine the generated videos and the original prompts as input for VLM verifiers.

Different from previous detail-oriented evaluation that focus on fine-grained quality features, we let the VLM to judge how many events are successfully presented in the generated videos, and thus get the completion rate of the story in prompt.

How to Construct StoryEval Prompt Suite.

We create the raw prompts by: (1) Retrieving and captioning the real-world videos. (2) Human Brainstorming. (3) Guided GPT-4o generation given examples from (1) and (2).

We will also filter the prompts in the following case: 1. Who uses the facial expression (like smiling) as a event. Because GPT-4o will blur the human face. 2. Who makes GPT-4o fails to give the completion rates (due to the reasons like security concern) for more than 2 models. 3. Whose corrsponding GPT-4o’s evaluation has relatively weak alignment with human feedback.

Prompt Suite Statistics

(Left) Word cloud of the StoryEval prompt suite. (Right) We visualize the proportion of 7 classes in the prompt suite using an UpSetplot. The bottom left of the figure shows the number of prompts in each class, while the right side displays the number of prompts belonging to each class or intersection group. For example, there are 14 examples that exactly belong to all three classes: ''Hard''', ''Creative'', and ''Object''.

Detailed Results with GPT-4o verifiers

Notably, ''Retrieval'' class includes the captions that are manually annotated for the videos retrieved from real-world, which means that in human aspect,the upper bound of this metric can be 100% if the retrieved videos are used as a baseline. But none of the T2V models here achieve completion rate higher than 45%.

Some Examples

Videos are compressed for loading. Please refresh the page if the gifs do not display

Prompt: A man opens the refrigerator door, puts the elephant in, and then closes the door. (3 events)

CL: Completion list, like CL[0]=1 if and only if the first event is considered completed.

Veo2 (CL=[0,1,0])

Gen3-Alpha (CL=[0,0,0])

Kling1.6 (CL=[1,0,0])

Pika2.0 (CL=[1,0,0])

Kling1.5 (CL=[1,0,1])

Hailuo (CL=[1,0,0])

Sora (CL=[0,0,0])

Sora-Storyboard(CL=[1,0,0])

Pika1.5 (CL=[0,0,0])

VideoCrafter2 (CL=[0,0,0])

OpenSora (CL=[1,0,0])

VChitect2 (CL=[1,0,0])

CogVideoX (CL=[1,0,0])

Pyramid (CL=[0,0,0])

EasyAnimate (CL=[0,0,0])

OpenSora-Plan(CL=[0,0,0])

PS: The example from Kling1.6 shows that the elephant walks into the refrigerator by itself instead of people put it in, so we judge this event as unfinished.

Prompt: A balloon artist inflates a long balloon, twists it several times, and creates a dog shape. (3 events)

CL: Completion list, like CL[0]=1 if and only if the first event is considered completed.

Gen3-Alpha (CL=[1,0,0])

Kling1.6 (CL=[0,0,0])

Pika2.0 (CL=[0,1,0])

Kling1.5 (CL=[0,1,0])

Hailuo (CL=[0,0,0])

Sora (CL=[0,1,0])

Sora-Storyboard(CL=[0,1,0])

Pika1.5 (CL=[0,0,0])

VideoCrafter2 (CL=[0,0,0])

OpenSora (CL=[0,0,0])

VChitect2 (CL=[0,1,0])

CogVideoX (CL=[0,1,0])

Pyramid (CL=[0,0,0])

EasyAnimate (CL=[0,0,1])

OpenSora-Plan(CL=[0,0,0])

Prompt: A bear opens a honey jar, licks the honey, and then closes the jar. (3 events)

CL: Completion list, like CL[0]=1 if and only if the first event is considered completed.

Gen3-Alpha (CL=[0,1,0])

Kling1.6 (CL=[0,1,0])

Pika2.0(CL=[0,0,0])

Kling1.5 (CL=[0,1,0])

Hailuo (CL=[0,1,0])

Sora (CL=[0,1,0])

Sora-Storyboard(CL=[0,0,0])

Pika1.5 (CL=[0,1,0])

VideoCrafter2 (CL=[0,1,0])

OpenSora (CL=[0,0,0])

VChitect2 (CL=[0,0,0])

CogVideoX (CL=[0,0,0])

Pyramid (CL=[0,0,0])

EasyAnimate (CL=[0,1,0])

OpenSora-Plan(CL=[0,0,0])

Prompt: A man takes off his hat, throws it into the air, and then it is taken by a passing eagle. (3 events)

CL: Completion list, like CL[0]=1 if and only if the first event is considered completed.

Gen3-Alpha (CL=[0,1,0])

Kling1.6 (CL=[0,0,0])

Pika2.0(CL=[0,1,1])

Kling1.5 (CL=[0,1,0])

Hailuo (CL=[1,0,0])

Sora (CL=[0,1,0])

Sora-Storyboard(CL=[0,1,0])

Pika1.5 (CL=[0,0,0])

VideoCrafter2 (CL=[0,0,0])

OpenSora (CL=[0,0,0])

VChitect2 (CL=[0,0,0])

CogVideoX (CL=[0,0,0])

Pyramid (CL=[0,0,0])

EasyAnimate (CL=[0,0,0])

OpenSora-Plan(CL=[0,0,0])

Prompt: A woman pours some coffee beans into a steel cup, walks with the cup towards a coffee bean grinder, and then pours the beans from the steel cup into the grinder. (3 events)

CL: Completion list, like CL[0]=1 if and only if the first event is considered completed.

Gen3-Alpha (CL=[0,1,0])

Kling1.6 (CL=[0,0,1])

Pika2.0(CL=[0,0,1])

Kling1.5 (CL=[0,0,1])

Hailuo (CL=[0,0,1])

Sora (CL=[1,0,0])

Sora-Storyboard(CL=[0,1,0])

Pika1.5 (CL=[0,1,0])

VideoCrafter2 (CL=[1,0,0])

OpenSora (CL=[0,0,0])

VChitect2 (CL=[0,0,0])

CogVideoX (CL=[1,0,0])

Pyramid (CL=[1,0,0])

EasyAnimate (CL=[0,0,0])

OpenSora-Plan(CL=[1,0,0])

Prompt: A man dribbles a basketball, and then throws it in a court (2 events)

CL: Completion list, like CL[0]=1 if and only if the first event is considered completed.

Gen3-Alpha (CL=[1,1])

Kling1.6 (CL=[1,0])

Pika2.0(CL=[1,0])

Kling1.5 (CL=[1,0])

Hailuo (CL=[1,0])

Sora (CL=[1,0])

Sora-Storyboard(CL=[1,0])

Pika1.5 (CL=[1,0])

VideoCrafter2 (CL=[0,0])

OpenSora (CL=[0,0])

VChitect2 (CL=[0,0])

CogVideoX (CL=[1,0])

Pyramid (CL=[0,0])

EasyAnimate (CL=[0,0])

OpenSora-Plan(CL=[0,0])