Paper ID

3b87e795f1f501843f7f99e83e38f125f6af8600


Title

Coherent Multi-Frame Story Visualization via Sequential Conditional GANs with Scene Graph Intermediaries


Introduction

Problem Statement

Current story-to-image generation models struggle to maintain coherent visual elements and spatial relationships across multiple frames, often resulting in inconsistent character appearances and scene layouts in multi-sentence narratives. This inconsistency significantly hampers the quality and usefulness of generated visual stories.

Motivation

Existing approaches typically use sequential GANs or transformer-based models to generate images frame-by-frame, but lack explicit mechanisms to ensure global consistency. By introducing an intermediate semantic scene graph representation, we can explicitly model and enforce consistency in object relationships and attributes across the narrative sequence. This approach bridges the gap between text understanding and image generation, allowing for more coherent and faithful visual storytelling.


Proposed Method

We propose a two-stage approach: 1) A text-to-scene-graph transformer that converts multi-sentence narratives into a sequence of semantic scene graphs, capturing objects, attributes, and relationships. 2) A graph-to-image GAN that synthesizes images from these scene graphs while maintaining global consistency. The scene graph sequence acts as a 'storyboard', allowing for iterative refinement. We introduce a novel graph attention mechanism in the GAN to focus on relevant subgraphs for each frame, and a consistency loss that penalizes deviations in object appearances across frames.


Experiments Plan

Step-by-Step Experiment Plan

Step 1: Data Preparation

Preprocess the PororoSV and FlintstonesSV datasets. Extract text descriptions, corresponding scene graphs, and ground truth images for each story sequence.

Step 2: Implement Text-to-Scene-Graph Transformer

Develop a transformer model that takes multi-sentence narratives as input and outputs a sequence of scene graphs. Use the extracted scene graphs from Step 1 as training data.

Step 3: Implement Graph-to-Image GAN

Develop a GAN model that takes scene graphs as input and generates corresponding images. Implement the novel graph attention mechanism and consistency loss.

Step 4: Training

Train both models separately. For the transformer, use cross-entropy loss on scene graph prediction. For the GAN, use a combination of adversarial loss, reconstruction loss, and the proposed consistency loss.

Step 5: End-to-End Pipeline

Integrate the two models into a single pipeline that takes a multi-sentence narrative as input and outputs a sequence of coherent images.

Step 6: Baseline Implementation

Implement state-of-the-art sequential GAN and transformer baselines for comparison.

Step 7: Evaluation

Evaluate the proposed method and baselines on test sets from PororoSV and FlintstonesSV. Use FID for image quality, CLIP score for text-image alignment, and implement a new metric for cross-frame consistency based on object tracking and attribute preservation.

Step 8: Analysis

Perform ablation studies to understand the contribution of each component (scene graph intermediary, graph attention, consistency loss) to the overall performance.

Test Case Examples

Baseline Prompt Input

Pororo and Crong are playing in the snow. Pororo builds a snowman while Crong makes snow angels. They decide to have a snowball fight.

Baseline Prompt Expected Output

A sequence of three images showing: 1) Pororo and Crong in a snowy setting, 2) Pororo next to a snowman and Crong lying in the snow, 3) Pororo and Crong throwing snowballs. However, the characters' appearances and the background details may be inconsistent across frames.

Proposed Prompt Input

Pororo and Crong are playing in the snow. Pororo builds a snowman while Crong makes snow angels. They decide to have a snowball fight.

Proposed Prompt Expected Output

A sequence of three images showing: 1) Pororo and Crong in a consistent snowy setting, 2) Pororo next to a snowman and Crong lying in the snow, with consistent character appearances and background, 3) Pororo and Crong throwing snowballs, maintaining visual coherence with the previous frames.

Explanation

The proposed method maintains consistent character appearances, background details, and spatial relationships across all frames, resulting in a more coherent visual story compared to the baseline method.

Fallback Plan

If the proposed method fails to significantly outperform baselines, we can pivot to an analysis paper focusing on the challenges of maintaining visual consistency in multi-frame story generation. We would conduct a thorough error analysis to identify common failure modes, such as object persistence, attribute consistency, or spatial relationship preservation. We could also investigate the effectiveness of the scene graph as an intermediate representation by visualizing and analyzing the generated graphs. Additionally, we might explore alternative architectures for the graph-to-image GAN, such as incorporating memory mechanisms or exploring different attention mechanisms. Finally, we could develop more fine-grained evaluation metrics that capture specific aspects of visual consistency, which could provide insights for future research in this area.


References

  1. Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation (2025)
  2. StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation (2022)
  3. Sequential Attention GAN for Interactive Image Editing (2018)
  4. Automated Visual Generation using GAN with Textual Information Feeds (2024)
  5. Test-time Prompt Refinement for Text-to-Image Models (2025)
  6. Progressive Text-to-Image Diffusion with Soft Latent Direction (2023)
  7. Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models (2025)
  8. Captain Cinema: Towards Short Movie Generation (2025)
  9. StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization (2025)
  10. FairyGen: Storied Cartoon Video from a Single Child-Drawn Character (2025)