Visual Story Generation

Course Project for 11777 Multimodal Machine Learning

[Full Writeup]

Partners: Nevan Giuliani, Alex Lyons, Hyunwoo (Shawn) Park

Idea

Given an initial text prompt and image, can we generate a story with new story prompts and images? Ideally, the story prompts are not just descriptions of each image but a connecting story. Also, the images should all relate to each other.

Datasets

We use the Visual Storytelling dataset for training/evaluation.

Implementation Details

Previous work like IntelligentGrimm takes in a full set of story prompts, and generates a set of images using a diffusion model, where each image is conditioned on the previous images. We improve in two ways

  • we predict captions along with image so that only an initial caption is required
  • we separate the task of predicting story captions and descriptive captions to be entered into the diffusion model, so that the story captions do not have to describe each subsequent image in detail
model diagram. We finetune a multimodal llava model to predict a descriptive text caption and a story text caption separatly, conditioned on previous images and previous captions. We finetune a diffusion model to predict the next frame in the story, conditioned on previous image and previous captions.

We finetune on images, descriptive captions, and story captions from the Visual Storytelling dataset.

Results

Baselines + Our method. Notice how baselines that do not condition on previous images do not have image to image consistency, and the IntelligentGrimm model fails at realism(as it was trained on more cartoon-ish data)
Some final stories and descriptive text generated by our pipeline