Visual Story Generation
Course Project for 11777 Multimodal Machine Learning
[Full Writeup]
Partners: Nevan Giuliani, Alex Lyons, Hyunwoo (Shawn) Park
Idea
Given an initial text prompt and image, can we generate a story with new story prompts and images? Ideally, the story prompts are not just descriptions of each image but a connecting story. Also, the images should all relate to each other.
Datasets
We use the Visual Storytelling dataset for training/evaluation.
Implementation Details
Previous work like IntelligentGrimm takes in a full set of story prompts, and generates a set of images using a diffusion model, where each image is conditioned on the previous images. We improve in two ways
- we predict captions along with image so that only an initial caption is required
- we separate the task of predicting story captions and descriptive captions to be entered into the diffusion model, so that the story captions do not have to describe each subsequent image in detail
We finetune on images, descriptive captions, and story captions from the Visual Storytelling dataset.