Kling1.5
completion list: [1, 0, 1]
The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios.
For example, top T2V generative models still fail to generate a video of the short simple story ''how to put an elephant into a refrigerator.'' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation.
To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2–4 consecutive events. We employ Vision-Language Models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.
We visualize their completion rates for the stories across 7 classes, along with the average result for the entire set. Even the best model achieves an average completion rate of less than 50%, meaning it can successfully present fewer than half of the events in a simple short story on average.
We carefully design 423 prompts across 7 classes, and each prompt illustrates a short story containing 2-4 sequential events. For evaluation, we choose 3 top closed-source commercial models and 8 well-known open-source models, use them for text-to-video generation, and then combine the generated videos and the original prompts as input for VLM verifiers.
Different from previous detail-oriented evaluation that focus on fine-grained quality features, we let the VLM to judge how many events are successfully presented in the generated videos, and thus get the completion rate of the story in prompt.
We create the raw prompts by: (1) Retrieving and captioning the real-world videos. (2) Human Brainstorming. (3) Guided GPT-4o generation given examples from (1) and (2).
We will also filter the prompts in the following case: 1. Who uses the facial expression (like smiling) as a event. Because GPT-4o will blur the human face. 2. Who makes GPT-4o fails to give the completion rates (due to the reasons like security concern) for more than 2 models. 3. Whose corrsponding GPT-4o’s evaluation has relatively weak alignment with human feedback.
(Left) Word cloud of the StoryEval prompt suite. (Right) We visualize the proportion of 7 classes in the prompt suite using an UpSetplot. The bottom left of the figure shows the number of prompts in each class, while the right side displays the number of prompts belonging to each class or intersection group. For example, there are 14 examples that exactly belong to all three classes: ''Hard''', ''Creative'', and ''Object''.
Notably, ''Retrieval'' class includes the captions that are manually annotated for the videos retrieved from real-world, which means that in human aspect,the upper bound of this metric can be 100% if the retrieved videos are used as a baseline. But none of the T2V models here achieve completion rate higher than 45%.
completion list: [1, 0, 1]
completion list: [1, 0, 0]
completion list: [0, 0, 0]
completion list: [1, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [1, 0, 0]
completion list: [1, 0, 0]
completion list: [1, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 1, 0]
completion list: [0, 0, 0]
completion list: [0, 1, 0]
completion list: [0, 1, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 1, 0]
completion list: [0, 1, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 1]
completion list: [0, 0, 0]
completion list: [0, 1, 0]
completion list: [1, 0, 0]
completion list: [0, 1, 0]
completion list: [0, 1, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 1]
completion list: [0, 0, 0]
completion list: [1, 0, 0]
completion list: [0, 1, 0]
completion list: [0, 1, 0]
completion list: [1, 0, 0]
completion list: [0, 0, 0]
completion list: [0, 0, 0]
completion list: [1, 0, 0]
completion list: [1, 0, 0]
completion list: [0, 0, 0]
completion list: [1, 0, 0]
completion list: [1, 0]
completion list: [1, 0]
completion list: [1, 0]
completion list: [1, 0]
completion list: [1, 0]
completion list: [0, 0]
completion list: [0, 0]
completion list: [0, 0]
completion list: [1, 0]
completion list: [0, 0]
completion list: [0, 0]
completion list: [0, 0]