TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Ni H, Egger B, Lohit S, Cherian A, Wang Y, Koike-Akino T, Huang SX, Marks TK (2024)

Publication Type: Conference contribution

Publication year: 2024

Journal

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition IEEE Computer Society; 1999

Publisher: IEEE Computer Society

Pages Range: 9015-9025

Conference Proceedings Title: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Event location: Seattle, WA

DOI: 10.1109/CVPR52733.2024.00861

Abstract

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., 'a woman is drinking water.'). Existing TI2V frameworks often require costly training on video-text datasets and spe-cific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a 'repeat-and-slide' strategy that modulates the reverse denoising process, al-lowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seam-lessly extend to other tasks such as video infilling and pre-diction when provided with more images. Its autoregressive design also supports long video generation.

Authors with CRIS profile

Bernhard Egger Juniorprofessur für Cognitive Computer Vision (Stiftungsprofessur)

Involved external institutions

Pennsylvania State University (Penn State)

United States (USA) (US) Mitsubishi Electric Research Laboratories

United States (USA) (US)

How to cite

APA:

Ni, H., Egger, B., Lohit, S., Cherian, A., Wang, Y., Koike-Akino, T.,... Marks, T.K. (2024). TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 9015-9025). Seattle, WA, US: IEEE Computer Society.

MLA:

Ni, Haomiao, et al. "TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models." Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA IEEE Computer Society, 2024. 9015-9025.

BibTeX: Download