Phenaki: AI model for generating videos from text prompts. Create multiple-minute videos with changing prompts. Read the paper now.
Video synthesis is a challenging task in artificial intelligence due to the complex nature of videos and the limited amount of high-quality data available. Phenaki is a new model developed by researchers at OpenAI that is capable of generating videos from textual prompts. This article will explore the capabilities of Phenaki, the challenges it overcomes, and how it works.
Phenaki uses a new causal model for learning video representation that compresses the video into a small representation of discrete tokens. The tokenizer uses causal attention in time, which allows it to work with variable-length videos. This model can generate arbitrary length videos that are conditioned on a sequence of prompts in an open domain. This approach is a significant departure from existing video generation methods that rely on per-frame baselines.
One of the major challenges in video synthesis is the limited amount of high-quality data available. To overcome this challenge, Phenaki uses joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples. This approach allows the model to generalize beyond what is available in the video datasets.
Phenaki can generate realistic videos that are conditioned on a sequence of prompts. The prompts used in the water is magical video are “A photorealistic teddy bear is swimming in the ocean at San Francisco,” “The teddy bear goes underwater,” “The teddy bear keeps swimming under the water with colorful fishes,” and “A panda bear is swimming underwater.” The resulting video is a magical and whimsical representation of the prompts.
Another example of Phenaki’s capabilities is the chilling on the beach video. The prompts used in this video are “A teddy bear diving in the ocean,” “A teddy bear emerges from the water,” “A teddy bear walks on the beach,” and “Camera zooms out to the teddy bear in the campfire by the beach.” The resulting video is a beautiful representation of a day at the beach, complete with the sound of the waves crashing and the fire crackling.
Phenaki can also generate videos with more complex prompts. The fireworks on the spacewalk video is an example of Phenaki’s ability to generate videos with multi-part prompts. The prompts used in this video are “Side view of an astronaut is walking through a puddle on Mars,” “The astronaut is dancing on Mars,” “The astronaut walks his dog on Mars,” and “The astronaut and his dog watch fireworks.” The resulting video is an incredible representation of an astronaut’s life on Mars, complete with fireworks and a pet dog.
Phenaki is a groundbreaking model that enables the generation of realistic videos from textual prompts. This is a highly challenging task due to the computational complexity involved, the limited availability of high-quality text-video data, and the variable length of videos. To address these issues, Phenaki uses a new causal model that compresses videos into a small representation of discrete tokens, which enables it to work with variable-length videos. The model uses a bidirectional masked transformer conditioned on pre-computed text tokens to generate video tokens from text, which are then de-tokenized to create the actual video.
Phenaki has several advantages over previous video generation methods. It can generate videos of arbitrary length conditioned on a sequence of prompts, allowing for the creation of time-variable stories. Additionally, joint training on a large corpus of image-text pairs and a smaller number of video-text examples enables the model to generalize beyond the available video datasets. The proposed video encoder-decoder outperforms all per-frame baselines used in the literature in terms of spatio-temporal quality and number of tokens per video.
Overall, Phenaki represents a significant advance in the field of video generation from text prompts. It has the potential to revolutionize many industries, including entertainment, advertising, and education. The ability to generate realistic videos from text opens up a range of possibilities for content creation that were previously impossible.