TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Ni H, Egger B, Lohit S, Cherian A, Wang Y, Koike-Akino T, Huang SX, Marks TK (2024)


Publication Type: Conference contribution

Publication year: 2024

Journal

Publisher: IEEE Computer Society

Pages Range: 9015-9025

Conference Proceedings Title: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Event location: Seattle, WA US

DOI: 10.1109/CVPR52733.2024.00861

Abstract

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., 'a woman is drinking water.'). Existing TI2V frameworks often require costly training on video-text datasets and spe-cific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a 'repeat-and-slide' strategy that modulates the reverse denoising process, al-lowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seam-lessly extend to other tasks such as video infilling and pre-diction when provided with more images. Its autoregressive design also supports long video generation.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Ni, H., Egger, B., Lohit, S., Cherian, A., Wang, Y., Koike-Akino, T.,... Marks, T.K. (2024). TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 9015-9025). Seattle, WA, US: IEEE Computer Society.

MLA:

Ni, Haomiao, et al. "TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models." Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA IEEE Computer Society, 2024. 9015-9025.

BibTeX: Download