Abstract

We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respectively, outperforming prior work by a large margin.

CVPR 2024 Presentation

Check out our CVPR 2024 poster here!

Method overview

GenHowTo example outputs

Given an image of an initial scene (red) and text prompts (bold), GenHowTo generates images corresponding to the action (blue) and the final state when action is completed (yellow). GenHowTo is learned from instructional videos and can generate new images of both seen and previously unseen object transformations. Importantly, GenHowTo learns to maintain the parts of the scene that showcase the action carried out in the same environment, in the spirit of HowTo examples, while introducing important objects (e.g., hand and knife in the first example) and transforming the object according to the prompt.

Citation

@inproceedings{soucek2024genhowto,
    title={GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos},
    author={Sou\v{c}ek, Tom\'{a}\v{s} and Damen, Dima and Wray, Michael and Laptev, Ivan and Sivic, Josef},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    month={June},
    year={2024}
}

Acknowledgements

This work was partly supported by the EU Horizon Europe Programme under the project EXA4MIND (No. 101092944) and the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140). Part of this work was done within the University of Bristol’s Machine Learning and Computer Vision (MaVi) Summer Research Program 2023. Research at the University of Bristol is supported by EPSRC UMPIRE (EP/T004991/1) and EPSRC PG Visual AI (EP/T028572/1).