ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

2503.21144

Jinwei Qi et el.

🤗 2025-03-28

TL;DR
#

Real-time interactive video-chat portraits have gained traction. Existing methods primarily focus on generating head movements, often struggling with synchronized body motions and fine-grained control over facial expressions. To address these issues, this paper presents a framework for stylized real-time portrait video generation, enabling flexible video chat that extends to upper-body interactions.

The approach involves efficient hierarchical motion diffusion models that account for both explicit and implicit motion representations based on audio inputs, generating diverse facial expressions and synchronized head and body movements. The system supports efficient and continuous generation of upper-body portrait video, achieving 30fps on a 4090 GPU, which supports interactive video-chat in real-time.

Key Takeaways
#

Why does it matter?
#

This paper introduces a real-time portrait video generation framework called ChatAnyone, enabling natural & expressive upper-body movements and facial expressions. It is significant for creating immersive digital interactions, and also sets the stage for future work in virtual avatars and human-computer interfaces.

Visual Insights
#

🔼 This figure demonstrates the real-time portrait video generation capabilities of the ChatAnyone model. The input consists of a single portrait image and an audio sequence. The output is a high-fidelity video of a full head and upper body avatar, exhibiting realistic and diverse facial expressions. The model allows for control over the style of the generated video.
read the caption
Figure 1: Illustration of real-time portrait video generation. Given a portrait image and audio sequence as input, our model can generate high-fidelity animation results from full head to upper-body interaction with diverse facial expressions and style control.

Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	FVD $\downarrow$	CSIM $\uparrow$	HKC $\uparrow$	FPS (Resolution)
FOMM [24]	18.92	0.677	0.269	42.690	569.893	0.525	0.494	87 (256*256)
MRAA [25]	19.12	0.696	0.253	35.546	419.293	0.536	0.534	77 (384*384)
LIA [31]	18.96	0.681	0.258	44.747	387.924	0.590	0.548	30 (256*256)
TPSMM [43]	19.64	0.707	0.237	34.509	384.663	0.597	0.567	48 (384*384)
w/o hand injection	24.59	0.829	0.132	6.825	38.401	0.605	0.607	34 (512*768)
w/o face refine	24.87	0.829	0.126	5.799	34.124	0.613	0.652	37 (512*768)
Ours	24.88	0.831	0.126	5.505	33.349	0.654	0.652	33 (512*768)
w/o facial hybrid control*	22.85	0.799	0.170	6.355	64.249	0.627	-	40 (512*512)
Ours*	23.09	0.807	0.166	6.297	47.914	0.632	-	41 (512*512)

🔼 This table presents a quantitative comparison of several methods for generating upper-body videos from a single image using a self-driven reenactment approach. The comparison includes metrics such as PSNR, SSIM, LPIPS, FID, FVD, CSIM, and HKC, assessing the visual quality and accuracy of the generated videos. The table also shows the inference speed in frames per second (FPS) and the resolution of the generated videos for each method. Note that results marked with an asterisk (*) used talking head video reenactment to specifically evaluate the effectiveness of the implicit facial keypoint offset technique.
read the caption
Table 1: Quantitative comparisons of upper-body video generation under self-driven reenactment mode. The inference speed, measured in frames per second (FPS), and the resolution of the generated output are also presented in the table. The * symbol denotes that the evaluations are conducted on talking head video reenactment to verify the effectiveness of implicit facial keypoint offset.

In-depth insights
#

Hierarchical Gen
#

Hierarchical generation could refer to a multi-stage or layered approach in generative models, where outputs are refined progressively. In video or image generation, this might involve creating a low-resolution base and then adding details in subsequent steps. The hierarchy could also relate to control, with high-level parameters setting the overall style or content and lower-level parameters controlling specific details. Such a structure allows for efficient generation and editing, as changes at a high level propagate to all subsequent levels, while changes at a low level only affect local details. This aligns well with how humans create complex outputs, starting with broad strokes and adding refinements. This enables control and coherence.

Hybrid Control
#

The concept of “Hybrid Control” in the context of portrait video generation likely refers to the combined use of explicit and implicit control mechanisms to achieve more nuanced and realistic facial expressions and body movements. Explicit control might involve using parameters like 3DMM coefficients or facial landmarks to directly manipulate specific features, offering precise control but potentially lacking fine-grained detail. Implicit control, on the other hand, could involve using latent variables or learned representations to capture subtle variations and styles, providing richer expressiveness but with less direct control. A hybrid approach aims to leverage the strengths of both, using explicit controls for overall structure and implicit controls for detail and style, ultimately leading to more controllable and expressive portrait videos. This fusion allows for detailed manipulation of features while keeping the capacity to represent subtle expressions.

Real-time GAN
#

Real-time GANs represent a significant area of research, focusing on achieving fast and efficient image and video generation. Traditional GANs often suffer from high computational costs, making them unsuitable for real-time applications. Research in this area aims to optimize GAN architectures and training methods to reduce inference time while maintaining high-quality output. Techniques include model compression, efficient network designs, and optimized training strategies. This is crucial for applications like real-time video editing, interactive gaming, and live streaming, where low latency is essential. The challenge lies in balancing speed and quality, as aggressive optimization can sometimes lead to a reduction in the visual fidelity of the generated content. Success in real-time GANs would enable more interactive and responsive AI-driven experiences.

Style Transfer
#

While not explicitly a standalone section, the concept of style transfer is woven into the core of the research. The paper leverages it primarily within the facial motion prediction stage. By conditioning the audio-driven motion diffusion model on a reference video, the framework gains the ability to imbue the generated portrait video with the stylistic nuances of the reference. This means that aspects like expressiveness intensity, subtle emotional cues, or even idiosyncratic head movements seen in the reference can be transferred to the generated avatar. This is achieved through Adaptive Layer Normalization(AdaLN). This allows for a more personalized and controllable output, going beyond simple audio-to-motion mapping. The effectiveness is shown that injecting expression information from reference video can improve the similarity to ground-truth. Further, the ablation study validates the influence of explicitly controlling the magnitude of expressions.

Upper-Body Focus
#

Focusing on the upper body in video generation tasks is crucial for creating realistic and engaging digital interactions. Unlike traditional methods that primarily address head movements and facial expressions, a dedicated approach to the upper body allows for the incorporation of natural body language, hand gestures, and subtle postural adjustments. This enhancement significantly contributes to the overall expressiveness and authenticity of the generated video. Accurate upper-body motion is critical for synchronizing with speech and conveying emotional nuances. Moreover, an upper-body focus enables a broader range of applications beyond simple talking head scenarios, such as virtual avatars, live streaming, and augmented reality, thereby enhancing user engagement and immersion. Challenges include capturing the complex interplay between facial expressions, head movements, and body language, as well as ensuring realistic hand gestures and seamless integration of the upper body with the overall scene. Effective solutions often involve advanced techniques for motion capture, body pose estimation, and realistic rendering of clothing and skin textures. The goal is to generate upper-body movements that are not only visually appealing but also contextually relevant and emotionally expressive.

Method	HKC $\uparrow$	CSIM $\uparrow$
CyberHost* [15]	0.723	0.706
Ours*	0.708	0.657
EchoMimicV2* [19]	0.618	0.621
Ours*	0.642	0.683

Method	FID $\downarrow$	CSIM $\uparrow$	Sync $\uparrow$	Diversity $\uparrow$
SadTalker [41]	52.32	0.595	4.120	0.112
AniTalker [16]	19.74	0.578	4.066	0.099
Ours	9.49	0.668	5.668	0.137

Method	MAE $\downarrow$	SSIM $\uparrow$
w/o style transfer	0.074	0.373
Ours	0.049	0.709

ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model

TL;DR
#

Key Takeaways
#

Why does it matter?
#

Visual Insights
#

In-depth insights
#

Hierarchical Gen
#

Hybrid Control
#

Real-time GAN
#

Style Transfer
#

Upper-Body Focus
#

More visual insights
#

Full paper
#

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Hierarchical Gen#

Hybrid Control#

Real-time GAN#

Style Transfer#

Upper-Body Focus#

More visual insights#

Full paper#

TL;DR
#

Key Takeaways
#

Why does it matter?
#

Visual Insights
#

In-depth insights
#

Hierarchical Gen
#

Hybrid Control
#

Real-time GAN
#

Style Transfer
#

Upper-Body Focus
#

More visual insights
#

Full paper
#