Report from the Heart of the Machine Author: Chen Ping, Zhang Qian Video How flexible can PS be? Recently, a study by Microsoft provides the answer. In this study, you only need to give AI a photo, and it can generate a video of the person in the photo, and the expressions and mo

2023-12-03 05:14:07 gossip 3767℃

Machine Heart Report

Video How flexible can PS be? Recently, a study by Microsoft provides the answer.

In this study, you only need to give AI a photo, and it can generate a video of the person in the photo, and the expressions and movements of the person can be controlled through text. For example, if the command you give is "open your mouth," the character in the video will actually open his mouth.

If the command you give is "sad", she will make sad expressions and head movements.

When the command "surprise" is given, the forehead lines of the avatar are squeezed together.

In addition, you can also give a piece of voice so that the mouth shape and movements of the virtual character match the voice. Or give a video of a real person and let the virtual character imitate it.

If you have more custom editing needs for avatar movements, such as making them nod, turn or tilt their heads, this technology is also supported.

This research is called GAIA (Generative AI for Avatar, generative AI for avatars), and its demo has begun to spread on social media. Many people admire its effect and hope to use it to "resurrection" the dead.

But some people are worried that the continued evolution of these technologies will make online videos more difficult to distinguish between genuine and fake, or be used by criminals for fraud. It seems that anti-fraud measures will continue to be upgraded. What innovations does

GAIA have?

talking avatar generation aims to synthesize natural videos based on speech, and the generated mouth shapes, expressions and head gestures should be consistent with the speech content. Previous studies have achieved high-quality results by implementing specific avatar training (i.e., training or tuning a specific model for each avatar), or by utilizing template videos during inference. Recently, efforts have been devoted to designing and improving methods for generating zero-shot talking avatars (i.e., only one portrait image of the target avatar can be used as an appearance reference). However, these methods reduce the difficulty of the task by adopting domain priors such as warping-based motion representation and 3D Morphable Model (3DMM). These heuristics, while effective, hinder direct learning from the data distribution and can lead to unnatural results and limited diversity.

In this article, researchers from Microsoft proposed GAIA (Generative AI for Avatar), which can synthesize natural talking virtual character videos from speech and single portrait pictures, eliminating domain priors in the generation process.

Project address: https://microsoft.github.io/GAIA/

Paper address: https://arxiv.org/pdf/2311.15230.pdf

GAIA reveals two key insights:

uses voice to drive virtual character movement, and The avatar's background and appearance remain unchanged throughout the video. Inspired by this, this paper separates the motion and appearance of each frame, where the appearance is shared between frames, while the motion is unique to each frame. In order to predict motion from speech, this paper encodes motion sequences into motion latent sequences and uses a diffusion model conditioned on the input speech to predict the latent sequence;

When a person is speaking a given content, the expression and head posture exist Huge diversity, which requires a large-scale and diverse data set. Therefore, this study collected a high-quality talking avatar dataset consisting of 16K unique speakers of different ages, genders, skin types, and speaking styles, making the generation results natural and diverse.

Based on the above two insights, this article proposes the GAIA framework, which consists of a variational autoencoder ( VAE ) (orange module) and a diffusion model (blue and green modules).

VAE is mainly used to decompose motion and appearance. It contains two encoders (i.e., motion encoder and appearance encoder) and a decoder. During training, the input to the motion encoder is the facial landmarks of the current frame, while the input to the appearance encoder is randomly sampled frames in the current video clip.

then optimizes the decoder to reconstruct the current frame based on the output of these two encoders. After obtaining the trained VAE, the underlying motion (i.e., the output of the motion encoder) is obtained for all training data.

Next, we train a diffusion model to predict motion latent sequences conditioned on a randomly sampled frame from a speech and video clip, which provides appearance information to the generation process.

During the inference process, given a reference portrait image of a target avatar, the diffusion model takes the image and the input speech sequence as conditions to generate a motion potential sequence that matches the speech content. The generated motion latent sequence and reference portrait image are then passed through a VAE decoder to synthesize the speaking video output.

on the data side. The study built datasets from different sources. The datasets they collected include HighDefinition Talking Face Dataset (HDTF) and Casual Conversation datasets v1v2 (CC v1v2). In addition to these three datasets, the study also collected a large-scale internal talking avatar dataset containing 7K hours of video and 8K speaker IDs. An overview of the dataset statistics is shown in Table 1.

In order to be able to learn the required information from the data, this article also proposes several automatic filtering strategies to ensure the quality of training data:

In order to make the lip movement visible, the front direction of the avatar should be towards the camera;

In order to ensure stability, Facial movements in the video should be smooth and should not shake rapidly;

In order to filter out extreme cases where lip movements are inconsistent with speech, frames in which the avatar is wearing a mask or remaining silent should be deleted.

This article trains VAE and diffusion models on filtered data. From the experimental results, this article has obtained three key conclusions:

GAIA can generate zero-sample speaking virtual characters and has superior performance in terms of naturalness, diversity, lip synchronization quality and visual quality. According to the subjective evaluation of the researchers, GAIA significantly surpasses all baseline methods;

training model sizes range from 150M to 2B, and the results show that GAIA is scalable because larger models produce better results;

GAIA is a general and flexible framework that enables different applications, including controllable speaking avatar generation and text-command avatar generation. How does

GAIA work?

During the experiment, the study compared GAIA to three powerful baselines, including FOMM, HeadGAN, and Face-vid2vid. The results are shown in Table 2: VAE in GAIA achieves consistent improvements over previous video-driven baselines, demonstrating that GAIA successfully decomposes appearance and motion representations.

speech driven results. Speech-driven speaking avatar generation is achieved by predicting motion from speech. Table 3 and Figure 2 provide quantitative and qualitative comparisons of GAIA with MakeItTalk, Audio2Head, and SadTalker methods.

As can be seen, GAIA significantly surpasses all baselines in terms of subjective evaluation. More specifically, as shown in Figure 2, even if the reference image is given with eyes closed or an unusual head pose, the generation of the baseline method is often highly dependent on the reference image. In contrast, GAIA works well for various references. The images are robust and produce results with greater naturalness, high lip synchronization, better visual quality, and motion diversity.

As shown in Table 3, the best MSI scores indicate that the videos generated by GAIA have excellent motion stability. The Sync-D score is 8.528, which is close to the score of real video (8.548), indicating that the generated video has excellent lip synchronization. This study achieved comparable FID scores to the baseline, which may be affected by different head poses, as the study found that the model without diffusion training achieved better FID scores, as shown in Table 6.

Submit articles or seek coverage: [email protected]

Tags： gossip

Prev post： News from New Hainan Client and Nanhai Net on December 2 (Reporter Liang Zhenwen) On the evening of December 2, Hainan Dongfang’s “Village BA” started with excitement at Dongfang City Cultural Square. In order to increase the enjoyment of the game, during the intermission of the

Next post： A scandal has been circulating for four years, and almost every month someone comes forward to reveal that the two are very much in love. No matter how uproar it is, I insist on not denying it - the romance rumors between actress Dilraba and actor Huang Jingyu have become It has