Paper ID

3b87e795f1f501843f7f99e83e38f125f6af8600


Title

Integrating sketches and narrative-preserving prompts to enhance story visualization.


Introduction

Problem Statement

Incorporating sketches and narrative-preserving prompt generation as control variables in an interactive story visualization framework will improve both semantic alignment and image quality compared to using static story visualization models.

Motivation

Existing story visualization methods often struggle with maintaining semantic consistency and high image quality when incorporating user inputs like sketches and text prompts. While some models focus on either semantic alignment or image quality, they rarely address both simultaneously, especially in dynamic story visualization contexts. No prior work has extensively explored the combined use of sketches and narrative-preserving prompt generation to enhance both semantic alignment and image quality in interactive story visualization frameworks. This gap is critical as it limits the potential for creating visually coherent and semantically rich story sequences that align closely with user intentions.


Proposed Method

This research explores the integration of sketches and narrative-preserving prompt generation within an interactive story visualization framework to enhance semantic alignment and image quality. Sketches provide dense control conditions, allowing users to define detailed scene layouts and key elements, which are then translated into high-quality images by text-to-image models. Narrative-preserving prompt generation ensures that the visualizations maintain the narrative structure, preserving key scenes and character attributes. By combining these two control variables, the system can produce images that are both visually consistent and semantically aligned with the narrative context. This approach addresses the limitations of existing models that focus on either semantic alignment or image quality but not both. The expected outcome is a significant improvement in both semantic alignment scores and FID scores, demonstrating the effectiveness of this integrated approach. This research is particularly relevant for applications requiring high-quality, semantically rich visual storytelling, such as interactive media and educational tools.

Background

Sketches: Sketches act as dense control conditions in the story visualization framework, allowing users to define detailed scene layouts and key elements. These sketches are translated into final images by text-to-image models, ensuring that the images produced are high in quality and consistent with the user's intentions. This approach allows for intuitive user interactions, as users can directly influence the visual outcome through their sketches. The use of sketches is expected to enhance image quality by providing precise control over scene composition.

Narrative-Preserving Prompt Generation: This approach involves generating prompts that preserve the narrative structure of the story, ensuring that visualizations effectively convey the story's intentions. The framework uses large language models to analyze narrative structures from plain text input and refine them into layered prompts for both background and foreground elements. This process, known as story distillation, ensures that key scenes and character attributes are preserved in the generated images, enhancing semantic consistency. This variable is expected to improve semantic alignment by maintaining narrative coherence throughout the visualization process.

Implementation

The proposed method involves integrating sketches and narrative-preserving prompt generation within an interactive story visualization framework. Initially, users provide sketches as dense control conditions to define scene layouts and key elements. These sketches are processed by a text-to-image model to generate high-quality images. Simultaneously, narrative-preserving prompt generation is employed to analyze the narrative structure of the story, refining it into layered prompts for both background and foreground elements. This ensures that the generated images maintain semantic consistency with the original narrative. The integration occurs at the input stage, where sketches and narrative prompts are combined to guide the image generation process. The system leverages large language models to interpret both inputs, ensuring that the visualizations align with user intentions and narrative context. The expected outcome is an improvement in both semantic alignment scores and FID scores, demonstrating the effectiveness of this integrated approach in producing visually consistent and semantically rich story sequences.


Experiments Plan

Operationalization Information

Please implement an experiment to evaluate the hypothesis that incorporating sketches and narrative-preserving prompt generation as control variables in an interactive story visualization framework will improve both semantic alignment and image quality compared to using static story visualization models.

Experiment Overview

This experiment will test a novel approach to story visualization that combines sketch-based control with narrative-preserving prompt generation. The system should take as input: (1) a narrative text and (2) a sketch representing the desired scene layout. It will then generate high-quality images that align with both the narrative context and the sketch guidance.

Implementation Requirements

  1. Create a global variable PILOT_MODE with three possible settings: MINI_PILOT, PILOT, or FULL_EXPERIMENT. The experiment should start in MINI_PILOT mode.

  1. Implement three distinct story visualization systems:
  2. Baseline 1 (Text-Only): A system that generates images using only text prompts derived from the narrative.
  3. Baseline 2 (Sketch-Only): A system that generates images using only sketch inputs without narrative context.
  4. Experimental System (Combined): A system that integrates both sketches and narrative-preserving prompts.

  1. For the narrative-preserving prompt generation component:
  2. Use a large language model to analyze narrative structures from plain text input
  3. Implement a "story distillation" process that extracts key scenes and character attributes
  4. Generate layered prompts for both background and foreground elements

  1. For the sketch processing component:
  2. Implement a method to process user-provided sketches as control conditions
  3. Integrate these sketches with the text-to-image model

  1. For the combined system:
  2. Develop an integration method that combines the sketch input with the narrative-preserving prompts
  3. Ensure the system maintains both visual consistency and semantic alignment

Dataset

Use the FlintstonesSV dataset, which contains story narratives and corresponding images. This dataset is well-suited for testing story visualization methods due to its rich narrative content and diverse character interactions.

Evaluation Metrics

Implement the following evaluation metrics:
1. Semantic Alignment:
- BLEU score (comparing generated image captions to ground truth)
- CIDEr score (measuring consensus in image descriptions)

  1. Image Quality:
  2. FID (Fréchet Inception Distance) score

Pilot Modes

MINI_PILOT Mode

PILOT Mode

FULL_EXPERIMENT Mode

Implementation Steps

  1. Data Preparation:
  2. Load and preprocess the FlintstonesSV dataset
  3. Extract story narratives and corresponding ground truth images
  4. For the MINI_PILOT and PILOT modes, select the appropriate subset of data

  1. Model Implementation:
  2. Implement the text-only baseline using a standard text-to-image model
  3. Implement the sketch-only baseline using a sketch-to-image model
  4. Implement the experimental system that combines both approaches

  1. Evaluation Pipeline:
  2. Implement the BLEU and CIDEr score calculations
  3. Implement the FID score calculation
  4. Create a results table that compares all three systems

  1. Visualization:
  2. Create side-by-side visualizations of the generated images from all three systems
  3. Include the original narrative text and sketch input for reference

Output Requirements

  1. Results File:
  2. A CSV file containing all evaluation metrics for each system
  3. Statistical significance tests comparing the systems
  4. Summary statistics for each metric

  1. Visualization File:
  2. A PDF or HTML file showing sample outputs from each system
  3. Side-by-side comparisons of the three systems for the same story segments

  1. Log File:
  2. Detailed logs of the experiment process
  3. Any errors or warnings encountered
  4. Timing information for each component

Please run the experiment in MINI_PILOT mode first, then if everything looks good, proceed to PILOT mode. After the PILOT mode completes successfully, stop and do not run the FULL_EXPERIMENT mode (a human will manually verify the results and make the change to FULL_EXPERIMENT if needed).

End Note:

The source paper is Paper 0: StoryGAN: A Sequential Conditional GAN for Story Visualization (241 citations, 2018). This idea draws upon a trajectory of prior work, as seen in the following sequence: Paper 1 --> Paper 2 --> Paper 3 --> Paper 4. The analysis reveals a progression from story visualization to automatic storyboard and comic generation, with applications in education and storytelling. However, there is a gap in exploring the integration of dynamic elements in story visualization, such as real-time interaction or adaptation to user inputs. Building upon the existing work, a novel research idea could focus on enhancing story visualization by incorporating interactive elements that allow users to influence the narrative flow and visual outcomes, thus addressing the limitations of static story visualization models.
The initial trend observed from the progression of related work highlights a consistent research focus. However, the final hypothesis proposed here is not merely a continuation of that trend — it is the result of a deeper analysis of the hypothesis space. By identifying underlying gaps and reasoning through the connections between works, the idea builds on, but meaningfully diverges from, prior directions to address a more specific challenge.


References

  1. StoryGAN: A Sequential Conditional GAN for Story Visualization (2018)
  2. Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences (2019)
  3. Automatic Comic Generation with Stylistic Multi-page Layouts and Emotion-driven Text Balloon Generation (2021)
  4. CodeToon: Story Ideation, Auto Comic Generation, and Structure Mapping for Code-Driven Storytelling (2022)
  5. Reference Guide for Teaching Programming with Comics (2022)
  6. Interactive Story Visualization with Multiple Characters (2023)
  7. AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort (2023)
  8. Learning to Model Multimodal Semantic Alignment for Story Visualization (2022)
  9. VisAgent: Narrative-Preserving Story Visualization Framework (2023)
  10. Visual Story Generation Based on Emotion and Keywords (2023)
  11. DiaryPlay: AI-Assisted Authoring of Interactive Vignettes for Everyday Storytelling (2025)
  12. DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts (2024)