Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

1University of Washington, 2University of California, Santa Curz, 3Georgia Institute of Technology, 4University of California, San Diego, 5Microsoft

Abstract

The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios.

For example, top T2V generative models still fail to generate a video of the short simple story ''how to put an elephant into a refrigerator.'' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation.

To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2–4 consecutive events. We employ Vision-Language Models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.

StoryEval Evaluation Results on 11 Text-to-Video Generative Models

We visualize their completion rates for the stories across 7 classes, along with the average result for the entire set. Even the best model achieves an average completion rate of less than 50%, meaning it can successfully present fewer than half of the events in a simple short story on average.

Pipeline of StoryEval Evaluation

We carefully design 423 prompts across 7 classes, and each prompt illustrates a short story containing 2-4 sequential events. For evaluation, we choose 3 top closed-source commercial models and 8 well-known open-source models, use them for text-to-video generation, and then combine the generated videos and the original prompts as input for VLM verifiers.

Different from previous detail-oriented evaluation that focus on fine-grained quality features, we let the VLM to judge how many events are successfully presented in the generated videos, and thus get the completion rate of the story in prompt.

How to Construct StoryEval Prompt Suite.

We create the raw prompts by: (1) Retrieving and captioning the real-world videos. (2) Human Brainstorming. (3) Guided GPT-4o generation given examples from (1) and (2).

We will also filter the prompts in the following case: 1. Who uses the facial expression (like smiling) as a event. Because GPT-4o will blur the human face. 2. Who makes GPT-4o fails to give the completion rates (due to the reasons like security concern) for more than 2 models. 3. Whose corrsponding GPT-4o’s evaluation has relatively weak alignment with human feedback.

Prompt Suite Statistics

(Left) Word cloud of the StoryEval prompt suite. (Right) We visualize the proportion of 7 classes in the prompt suite using an UpSetplot. The bottom left of the figure shows the number of prompts in each class, while the right side displays the number of prompts belonging to each class or intersection group. For example, there are 14 examples that exactly belong to all three classes: ''Hard''', ''Creative'', and ''Object''.

Detailed Results with GPT-4o verifiers

Notably, ''Retrieval'' class includes the captions that are manually annotated for the videos retrieved from real-world, which means that in human aspect,the upper bound of this metric can be 100% if the retrieved videos are used as a baseline. But none of the T2V models here achieve completion rate higher than 45%.

Some Examples

Note: completion list denotes if each event is completed (1) or not (0). And videos are compressed for loading





Prompt: A man opens the refrigerator door, puts the elephant in, and then closes the door. (3 events)
Kling1.5

completion list: [1, 0, 1]

Hailuo

completion list: [1, 0, 0]

Sora

completion list: [0, 0, 0]

Sora-Storyboard

completion list: [1, 0, 0]

Pika1.5

completion list: [0, 0, 0]

VideoCrafter2

completion list: [0, 0, 0]

OpenSora

completion list: [1, 0, 0]

VChitect2

completion list: [1, 0, 0]

CogVideoX

completion list: [1, 0, 0]

Pyramid

completion list: [0, 0, 0]

EasyAnimate

completion list: [0, 0, 0]

OpenSora-Plan

completion list: [0, 0, 0]





Prompt: A balloon artist inflates a long balloon, twists it several times, and creates a dog shape. (3 events)
Kling1.5

completion list: [0, 1, 0]

Hailuo

completion list: [0, 0, 0]

Sora

completion list: [0, 1, 0]

Sora-Storyboard

completion list: [0, 1, 0]

Pika1.5

completion list: [0, 0, 0]

VideoCrafter2

completion list: [0, 0, 0]

OpenSora

completion list: [0, 0, 0]

VChitect2

completion list: [0, 1, 0]

CogVideoX

completion list: [0, 1, 0]

Pyramid

completion list: [0, 0, 0]

EasyAnimate

completion list: [0, 0, 1]

OpenSora-Plan

completion list: [0, 0, 0]





Prompt: A man takes off his hat, throws it into the air, and then it is taken by a passing eagle. (3 events)
Kling1.5

completion list: [0, 1, 0]

Hailuo

completion list: [1, 0, 0]

Sora

completion list: [0, 1, 0]

Sora-Storyboard

completion list: [0, 1, 0]

Pika1.5

completion list: [0, 0, 0]

VideoCrafter2

completion list: [0, 0, 0]

OpenSora

completion list: [0, 0, 0]

VChitect2

completion list: [0, 0, 0]

CogVideoX

completion list: [0, 0, 0]

Pyramid

completion list: [0, 0, 0]

EasyAnimate

completion list: [0, 0, 0]

OpenSora-Plan

completion list: [0, 0, 0]





Prompt: A woman pours some coffee beans into a steel cup, walks with the cup towards a coffee bean grinder, and then pours the beans from the steel cup into the grinder. (3 events)
Kling1.5

completion list: [0, 0, 1]

Hailuo

completion list: [0, 0, 0]

Sora

completion list: [1, 0, 0]

Sora-Storyboard

completion list: [0, 1, 0]

Pika1.5

completion list: [0, 1, 0]

VideoCrafter2

completion list: [1, 0, 0]

OpenSora

completion list: [0, 0, 0]

VChitect2

completion list: [0, 0, 0]

CogVideoX

completion list: [1, 0, 0]

Pyramid

completion list: [1, 0, 0]

EasyAnimate

completion list: [0, 0, 0]

OpenSora-Plan

completion list: [1, 0, 0]





Prompt: A man dribbles a basketball, and then throws it in a court (2 events)
Kling1.5

completion list: [1, 0]

Hailuo

completion list: [1, 0]

Sora

completion list: [1, 0]

Sora-Storyboard

completion list: [1, 0]

Pika1.5

completion list: [1, 0]

VideoCrafter2

completion list: [0, 0]

OpenSora

completion list: [0, 0]

VChitect2

completion list: [0, 0]

CogVideoX

completion list: [1, 0]

Pyramid

completion list: [0, 0]

EasyAnimate

completion list: [0, 0]

OpenSora-Plan

completion list: [0, 0]