Motion Anything: Any to Motion Generation

2503.06955

Zeyu Zhang et el.

🤗 2025-03-13

TL;DR
#

Conditional motion generation struggles with prioritizing dynamic frames based on conditions and effectively integrating multiple modalities. Existing masking models need improvement for different conditions. To solve this, the paper introduces an ‘Attention-based Mask Modeling’ method for spatial and temporal control over key frames and actions. This will enable the model to focus on key actions during motion generation. In addition, this model adaptively encodes text and music to improve controllability.

To help this research, the paper presents ‘Text-Music-Dance (TMD),’ which has paired music and text. Experimental results show that this framework surpasses current state-of-the-art methods on multiple benchmarks, with significant improvement of about 15% in FID on HumanML3D. It demonstrates consistent gains on the AIST++ and TMD datasets. This new method will push the boundary of conditional motion generation.

Key Takeaways
#

Why does it matter?
#

This work facilitates controllable multimodal motion generation, with a new dataset. This enables more versatile & precise motion generation. Future research can explore & refine this approach for virtual characters, human-computer interaction, & robotics, advancing the field.

Visual Insights
#

🔼 This figure illustrates the core difference between the traditional random masking approach used in previous autoregressive motion generation models and the novel attention-based masking method introduced in this paper. The top panel displays the random masking technique, where the model randomly masks various parts of the motion sequence, irrespective of the importance or relevance to the input condition. In contrast, the bottom panel showcases the attention-based masking strategy. The model assigns attention weights to different parts of the motion based on the input condition, and strategically masks out less relevant parts. The color-coded areas highlight the dynamic, crucial frames and body parts that are prioritized and preserved during the masking process, ensuring the quality and coherence of the generated motion based on the provided conditions.
read the caption
Figure 1: Masking strategy comparison. This figure demonstrates the key differences between the previous random masking strategy [21] (top) and our attention-based masking (bottom). Our masking strategy focuses on the more significant and dynamic parts of the motion (colored) corresponding to the condition.

Models	Text-to-Motion	Music-to-Dance	Text and Music to Dance
TM2D [17]	✓	✓	✗
UDE [83]	✓	✓	✗
UDE-2 [84]	✓	✓	✗
MoFusion [12]	✓	✓	✗
MCM [39]	✓	✓	✓
LMM [73]	✓	✓	✗
MotionCraft [5]	✓	✓	✗
MagicPose4D [67]	✗	✗	✗
STAR [8]	✓	✗	✗
TC4D [3]	✓	✗	✗
Motion Avatar [77]	✓	✗	✗
Motion Anything (Ours)	✓	✓	✓

🔼 This table compares different methods for motion generation, highlighting their ability to handle single versus multiple conditioning modalities. Most existing methods, whether single-task or multi-task, can only process one type of condition at a time (e.g., text or music). This limits their control over the generated motion. In contrast, the proposed method, ‘Motion Anything,’ uniquely handles multiple modalities simultaneously and adaptively, leading to more controllable motion generation.
read the caption
Table 1: Methods comparison. Either single-task or multi-task models can handle only one condition at a time, overlooking the importance of integrating multiple modalities for more controllable generation. Our Motion Anything introduces an innovative approach that encodes different modalities simultaneously and adaptively for more controllable generation.

In-depth insights
#

Attention Masking
#

Attention masking is likely a technique used to selectively focus on important parts of the input data while ignoring irrelevant information. It could be used in various modalities, including text, audio, and video. It helps the model prioritize key features and reduce computational cost. Attention masking can be applied in both temporal and spatial dimensions. Temporally, it helps in selecting key frames or time steps, while spatially, it focuses on important regions or body parts. It enables the model to learn more robust representations by focusing on the most relevant information based on the current context. This is especially valuable for multimodal data where different modalities may have varying degrees of importance.

Multi-Modal TMD
#

The idea of a ‘Multi-Modal TMD’ (Text-Music-Dance) approach is compelling, suggesting a deeper integration of diverse data streams for motion generation. This goes beyond simple concatenation, implying a synergistic model where text provides semantic grounding, music dictates rhythm and style, and the dance output reflects a coherent blend. This is crucial because current models often treat modalities separately, limiting control and expressiveness. A true multi-modal system would leverage attention mechanisms to prioritize key elements from each input, ensuring dynamic frames and body parts align with the combined context. Furthermore, a robust dataset with paired text, music, and dance is essential for training, filling a current gap in the research landscape and facilitating exploration of complex correlations between modalities, which may advance future motion generation research.

Adaptive Control
#

While ‘Adaptive Control’ isn’t explicitly present, its principles are woven throughout the paper’s methodology. The core idea is to make the model more responsive and flexible to various inputs. Motion Anything adapts by using attention mechanisms that prioritize key frames and body parts depending on conditions. This enables the model to focus on the most important parts of the motion. Also, having a Temporal Adaptive Transformer (TAT) that aligns temporal tokens to match conditions in any modality. The ability to handle multimodal inputs further demonstrates adaptivity, allowing the model to integrate information from text and music for better control and coherence, enabling the model to respond effectively.

4D Avatars
#

The idea of ‘4D Avatars’ has seen a surge, focusing on creating dynamic 3D models that evolve over time. Existing methods often struggle with limited control over motion and inconsistencies in the mesh appearance. A feedforward approach aims to resolve these by generating avatars from a single prompt, streamlining the process. By leveraging advances in motion generation and combining it with 3D avatar creation, the ‘4D Avatars’ can achieve more realistic and expressive results. This synthesis promises avatars with more precise movements and consistent visual quality. The focus lies on automating rigging to improve the realism of avatar movements. By tackling these challenges, the next generation of ‘4D Avatars’ can unlock exciting opportunities.

Key-Frame Focus
#

The concept of “Key-Frame Focus” in motion generation likely refers to a methodology that prioritizes the accurate and detailed generation of key frames within a motion sequence. This approach contrasts with methods that treat all frames equally, instead allocating more computational resources and attention to frames deemed more important for conveying the overall motion and its nuances. Key frames often represent points of significant change or emphasis in the movement, such as the peak of a jump or the moment of impact in a collision. By focusing on these critical junctures, the system can achieve higher fidelity in the most visually salient parts of the motion, potentially allowing for a more efficient use of resources as less critical frames can be interpolated or generated with less detail. The identification of key frames could rely on various criteria, including detecting points of high acceleration, changes in direction, or semantic importance based on the input conditions (text, music, etc.). Furthermore, effective methods for key frame focus would likely involve techniques to ensure smooth transitions between key frames and maintain overall coherence in the generated motion sequence.

More visual insights
#

More on tables

Datasets	Method	R Precision $\uparrow$			FID $\downarrow$	MultiModal Dist $\downarrow$	Diversity $\rightarrow$	MultiModality $\uparrow$
Datasets	Method	Top 1	Top 2	Top 3	FID $\downarrow$	MultiModal Dist $\downarrow$	Diversity $\rightarrow$	MultiModality $\uparrow$
Human ML3D [19]	Ground Truth	$0.511^{\pm.003}$	$0.703^{\pm.003}$	$0.797^{\pm.002}$	$0.002^{\pm.000}$	$2.974^{\pm.008}$	$9.503^{\pm.065}$	-
	TM2D [17]	$0.319^{\pm.000}$	-	-	$1.021^{\pm.000}$	$4.098^{\pm.000}$	$\mathbf{9.513}^{\pm.000}$	$\mathbf{4.139}^{\pm.000}$
	MotionCraft [5]	${0.501}^{\pm.003}$	${0.697}^{\pm.003}$	${0.796}^{\pm.002}$	${0.173}^{\pm.002}$	${3.025}^{\pm.008}$	${9.543}^{\pm.098}$	-
	ReMoDiffuse [70]	${0.510}^{\pm.005}$	${0.698}^{\pm.006}$	${0.795}^{\pm.004}$	${0.103}^{\pm.004}$	${2.974}^{\pm.016}$	${9.018}^{\pm.075}$	$1.795^{\pm.043}$
	MMM [45]	${0.504}^{\pm.003}$	${0.696}^{\pm.003}$	${0.794}^{\pm.002}$	${0.080}^{\pm.003}$	${2.998}^{\pm.007}$	${9.411}^{\pm.058}$	$1.164^{\pm.041}$
	DiverseMotion [41]	${0.515}^{\pm.003}$	${0.706}^{\pm.002}$	${0.802}^{\pm.002}$	${0.072}^{\pm.004}$	${2.941}^{\pm.007}$	${9.683}^{\pm.102}$	$1.869^{\pm.089}$
	BAD [22]	${0.517}^{\pm.002}$	${0.713}^{\pm.003}$	${0.808}^{\pm.003}$	${0.065}^{\pm.003}$	${2.901}^{\pm.008}$	${9.694}^{\pm.068}$	$1.194^{\pm.044}$
	BAMM [44]	${0.525}^{\pm.002}$	${0.720}^{\pm.003}$	${0.814}^{\pm.003}$	${0.055}^{\pm.002}$	${2.919}^{\pm.008}$	${9.717}^{\pm.089}$	$1.687^{\pm.051}$
	MCM [39]	${0.502}^{\pm.002}$	${0.692}^{\pm.004}$	${0.788}^{\pm.006}$	${0.053}^{\pm.007}$	${3.037}^{\pm.003}$	${9.585}^{\pm.082}$	$0.810^{\pm.023}$
	MoMask [21]	${0.521}^{\pm.002}$	${0.713}^{\pm.002}$	${0.807}^{\pm.002}$	${0.045}^{\pm.002}$	${2.958}^{\pm.008}$	-	$1.241^{\pm.040}$
	LMM [73]	${0.525}^{\pm.002}$	$\underline{0.719}^{\pm.002}$	${0.811}^{\pm.002}$	${0.040}^{\pm.002}$	${2.943}^{\pm.012}$	$9.814^{\pm.076}$	$2.683^{\pm.054}$
	MoGenTS [64]	$\underline{0.529}^{\pm.003}$	$\underline{0.719}^{\pm.002}$	$\underline{0.812}^{\pm.002}$	$\underline{0.033}^{\pm.001}$	$\underline{2.867}^{\pm.006}$	$9.570^{\pm.077}$	-
	Motion Anything (Ours)	$\mathbf{0.546}^{\pm.003}$	$\mathbf{0.735}^{\pm.002}$	$\mathbf{0.829}^{\pm.002}$	$\mathbf{0.028}^{\pm.005}$	$\mathbf{2.859}^{\pm.010}$	$\underline{9.521}^{\pm.083}$	$\underline{2.705}^{\pm.068}$
KIT- ML [46]	Ground Truth	$0.424^{\pm.005}$	$0.649^{\pm.006}$	$0.779^{\pm.006}$	$0.031^{\pm.004}$	$2.788^{\pm.012}$	$11.08^{\pm.097}$	-
	ReMoDiffuse [70]	${0.427}^{\pm.014}$	${0.641}^{\pm.004}$	${0.765}^{\pm.055}$	${0.155}^{\pm.006}$	${2.814}^{\pm.012}$	${10.80}^{\pm.105}$	$1.239^{\pm.028}$
	MMM [45]	${0.404}^{\pm.005}$	${0.621}^{\pm.005}$	${0.744}^{\pm.004}$	${0.316}^{\pm.028}$	${2.977}^{\pm.019}$	${10.91}^{\pm.101}$	$1.232^{\pm.039}$
	DiverseMotion [41]	${0.416}^{\pm.005}$	${0.637}^{\pm.008}$	${0.760}^{\pm.011}$	${0.468}^{\pm.098}$	${2.892}^{\pm.041}$	${10.87}^{\pm.101}$	$\mathbf{2.062}^{\pm.079}$
	BAD [22]	${0.417}^{\pm.006}$	${0.631}^{\pm.006}$	${0.750}^{\pm.006}$	${0.221}^{\pm.012}$	${2.941}^{\pm.025}$	$\underline{11.00}^{\pm.100}$	$1.170^{\pm.047}$
	BAMM [44]	${0.438}^{\pm.009}$	${0.661}^{\pm.009}$	${0.788}^{\pm.005}$	${0.183}^{\pm.013}$	${2.723}^{\pm.026}$	$\mathbf{11.01}^{\pm.094}$	$1.609^{\pm.065}$
	MoMask [21]	${0.433}^{\pm.007}$	${0.656}^{\pm.005}$	${0.781}^{\pm.005}$	${0.204}^{\pm.011}$	${2.779}^{\pm.022}$	-	$1.131^{\pm.043}$
	LMM [73]	${0.430}^{\pm.015}$	${0.653}^{\pm.017}$	${0.779}^{\pm.014}$	$\underline{0.137}^{\pm.023}$	${2.791}^{\pm.018}$	$11.24^{\pm.103}$	$\underline{1.885}^{\pm.127}$
	MoGenTS [64]	$\underline{0.445}^{\pm.006}$	$\underline{0.671}^{\pm.006}$	$\underline{0.797}^{\pm.005}$	${0.143}^{\pm.004}$	${2.711}^{\pm.024}$	$10.92^{\pm.090}$	-
	Motion Anything (Ours)	$\mathbf{0.449}^{\pm.007}$	$\mathbf{0.678}^{\pm.004}$	$\mathbf{0.802}^{\pm.006}$	$\mathbf{0.131}^{\pm.003}$	$\mathbf{2.705}^{\pm.024}$	${10.94}^{\pm.098}$	${1.374}^{\pm.069}$

🔼 This table presents a quantitative comparison of various methods for text-to-motion generation, evaluated on the HumanML3D and KIT-ML datasets. Metrics include R-Precision (a measure of retrieval accuracy), FID (Fréchet Inception Distance, indicating the quality and realism of the generated motion), MultiModal Distance (measuring alignment between the generated motion and text description), Diversity (capturing the variety of generated motions), and MultiModality (assessing diversity within motions from the same text prompt). Higher R-Precision and Diversity scores are better, while lower FID and MultiModal Distance scores are better. The best-performing methods for each metric are highlighted in bold and underlined. The arrow indicates that a closer value to the ground truth is better. Methods capable of handling multiple conditioning modalities (like text and audio simultaneously) are highlighted in blue. This comparison allows readers to assess the relative performance of different models based on multiple evaluation aspects.
read the caption
Table 2: Quantitative comparison on HumanML3D [19] and KIT-ML [46]. The best and runner-up values are bold and underlined. The right arrow →→\rightarrow→ indicates that closer values to ground truth are better. Multimodal motion generation methods are highlighted in blue.

	Motion Quality		Motion Diversity
Method	FID ${}_{k}\downarrow$	FID ${}_{g}\downarrow$	Div ${}_{k}\uparrow$	Div ${}_{g}\uparrow$	BAS $\uparrow$
Ground Truth	17.10	10.60	8.19	7.45	0.2374
TSMT [33]	86.43	43.46	6.85	3.32	0.1607
Dance Revolution [24]	73.42	25.92	3.52	4.87	0.1950
DanceNet [86]	69.18	25.49	2.86	2.85	0.1430
MoFusion [12]	50.31	-	9.09	-	0.2530
EDGE [54]	42.16	22.12	3.96	4.61	0.2334
Lodge [36]	37.09	18.79	5.58	4.85	0.2423
FACT [34]	35.35	22.11	5.94	6.18	0.2209
Bailando [51]	28.16	9.62	7.83	6.34	0.2332
TM2D [17]	23.94	9.53	7.69	4.53	0.2127
BADM [66]	-	-	8.29	6.76	0.2366
LMM [73]	22.08	21.97	9.85	6.72	0.2249
Bailando++ [52]	17.59	10.10	8.64	6.50	0.2720
UDE [83]	17.25	8.69	7.78	5.81	0.2310
MCM [39]	$\mathbf{15.57}$	25.85	6.50	5.74	0.2750
Motion Anything (Ours)	$\underline{17.22}$	$\mathbf{8.56}$	$\mathbf{9.91}$	$\mathbf{6.79}$	$\mathbf{0.2757}$

🔼 This table presents a quantitative comparison of different methods for music-to-dance generation on the AIST++ benchmark dataset. It compares the performance of various methods across multiple metrics, including FID (Fréchet Inception Distance) for quality assessment, metrics for motion diversity, and a beat alignment score (BAS). The best and second-best results for each metric are highlighted. Methods capable of handling multimodal conditioning (music and other modalities) are visually distinguished.
read the caption
Table 3: Quantitative comparison on AIST++ [34]. The best and runner-up values are bold and underlined. Multimodal motion generation methods are highlighted in blue.

	Motion Quality		Motion Diversity
Method	FID ${}_{k}\downarrow$	FID ${}_{g}\downarrow$	Div ${}_{k}\uparrow$	Div ${}_{g}\uparrow$	BAS $\uparrow$	MMDist $\downarrow$	MModality $\uparrow$
Ground Truth	20.72	11.37	7.42	6.94	0.2105	5.07	-
TM2D [17]	26.78	12.04	6.25	4.41	0.2001	6.13	2.232
MotionCraft [5]	24.21	26.39	7.02	5.79	0.2036	5.82	$\mathbf{2.481}$
Motion Anything	$\mathbf{21.46}$	$\mathbf{11.44}$	$\mathbf{7.04}$	$\mathbf{6.15}$	$\mathbf{0.2094}$	$\mathbf{5.34}$	2.424

🔼 Table 4 presents a quantitative comparison of different methods on the Text-Music-Dance (TMD) dataset. It shows the performance of various models across multiple metrics, including FID (Frechet Inception Distance), which measures the quality and realism of the generated motion; diversity metrics (Divk and Divg) which assess the variety in generated motions; BAS (Beat Alignment Score), indicating how well the generated dance aligns with the music; MMDist (Multimodal Distance), measuring the alignment between the text and motion; and MModality (Multimodality), evaluating the diversity among motions generated from the same text description. The best and second-best performance for each metric are highlighted in bold and underlined.
read the caption
Table 4: Quantitative comparison on TMD. The best and runner-up values are bold and underlined.

Method	R Precision $\uparrow$			FID $\downarrow$	MM Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Method	Top 1	Top 2	Top 3	FID $\downarrow$	MM Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Ground Truth	$0.511^{\pm.003}$	$0.703^{\pm.003}$	$0.797^{\pm.002}$	$0.002^{\pm.000}$	$2.974^{\pm.008}$	$9.503^{\pm.065}$	-
Random Masking [21]	${0.522}^{\pm.004}$	${0.714}^{\pm.003}$	${0.818}^{\pm.006}$	${0.049}^{\pm.023}$	${2.945}^{\pm.027}$	${9.633}^{\pm.218}$	${2.538}^{\pm.035}$
KMeans [40]	${0.528}^{\pm.003}$	${0.709}^{\pm.004}$	${0.823}^{\pm.006}$	${0.042}^{\pm.032}$	$\underline{2.871}^{\pm.035}$	${9.549}^{\pm.173}$	${2.548}^{\pm.023}$
GMM [49]	${0.531}^{\pm.002}$	${0.721}^{\pm.004}$	$\underline{0.826}^{\pm.008}$	${0.039}^{\pm.021}$	${2.887}^{\pm.024}$	${9.602}^{\pm.138}$	${2.488}^{\pm.031}$
Confidence-based Masking [45]	${0.524}^{\pm.007}$	${0.731}^{\pm.001}$	${0.818}^{\pm.004}$	${0.047}^{\pm.023}$	${2.928}^{\pm.009}$	${9.530}^{\pm.095}$	${2.574}^{\pm.039}$
Density-based Masking [75]	$\underline{0.538}^{\pm.005}$	$\underline{0.733}^{\pm.002}$	${0.819}^{\pm.006}$	$\underline{0.031}^{\pm.035}$	${2.913}^{\pm.021}$	$\mathbf{9.518}^{\pm.138}$	$\underline{2.608}^{\pm.043}$
Attention-based Masking	$\mathbf{0.546}^{\pm.003}$	$\mathbf{0.735}^{\pm.002}$	$\mathbf{0.829}^{\pm.002}$	$\mathbf{0.028}^{\pm.005}$	$\mathbf{2.859}^{\pm.010}$	$\underline{9.521}^{\pm.083}$	$\mathbf{2.705}^{\pm.068}$

🔼 This ablation study analyzes the effectiveness of different masking strategies on the HumanML3D dataset for text-to-motion generation. It compares the performance of the proposed attention-based masking against several alternative approaches: random masking, KMeans, Gaussian Mixture Model (GMM), confidence-based masking, and density-based masking. The results demonstrate the superiority of the attention-based masking strategy in terms of key metrics such as FID (Frechet Inception Distance), MultiModal Distance, Diversity, and MultiModality. Higher values for R-Precision and Diversity are better, while lower values for FID and MultiModal Distance are preferred. The arrow indicates that values closer to the ground truth are better.
read the caption
Table 5: Ablation study of the masking strategy on HumanML3D [19]. The best and runner-up values are bold and underlined. The right arrow →→\rightarrow→ indicates that closer values to ground truth are better.

Method	R Precision $\uparrow$			FID $\downarrow$	MM Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Method	Top 1	Top 2	Top 3	FID $\downarrow$	MM Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Ground Truth	$0.511^{\pm.003}$	$0.703^{\pm.003}$	$0.797^{\pm.002}$	$0.002^{\pm.000}$	$2.974^{\pm.008}$	$9.503^{\pm.065}$	-
T:15% S:15%	${0.523}^{\pm.005}$	${0.716}^{\pm.002}$	${0.818}^{\pm.005}$	${0.047}^{\pm.034}$	${2.920}^{\pm.026}$	${9.625}^{\pm.145}$	${2.580}^{\pm.064}$
T:15% S:30%	${0.529}^{\pm.002}$	${0.718}^{\pm.005}$	${0.822}^{\pm.003}$	${0.044}^{\pm.046}$	${2.914}^{\pm.023}$	${9.573}^{\pm.163}$	${2.631}^{\pm.024}$
T:15% S:50%	${0.530}^{\pm.002}$	${0.715}^{\pm.007}$	${0.820}^{\pm.007}$	${0.045}^{\pm.035}$	${2.918}^{\pm.019}$	${9.632}^{\pm.217}$	${2.611}^{\pm.026}$
T:30% S:15%	${0.535}^{\pm.007}$	$\underline{0.728}^{\pm.001}$	$\underline{0.823}^{\pm.004}$	${0.036}^{\pm.027}$	$\underline{2.873}^{\pm.037}$	$\mathbf{{9.527}}^{\pm.116}$	$\underline{2.709}^{\pm.027}$
T:30% S:30%	$\mathbf{0.546}^{\pm.003}$	$\mathbf{0.735}^{\pm.002}$	$\mathbf{0.829}^{\pm.002}$	$\mathbf{0.028}^{\pm.005}$	$\mathbf{2.859}^{\pm.010}$	$\underline{9.521}^{\pm.083}$	${2.705}^{\pm.068}$
T:30% S:50%	$\underline{0.541}^{\pm.004}$	${0.726}^{\pm.003}$	${0.821}^{\pm.005}$	$\underline{0.033}^{\pm.035}$	${2.926}^{\pm.054}$	${9.519}^{\pm.196}$	$\mathbf{{2.710}}^{\pm.037}$
T:50% S:15%	${0.525}^{\pm.005}$	${0.720}^{\pm.003}$	${0.820}^{\pm.009}$	${0.043}^{\pm.028}$	${2.940}^{\pm.044}$	${9.620}^{\pm.134}$	${2.584}^{\pm.063}$
T:50% S:30%	${0.525}^{\pm.007}$	${0.723}^{\pm.004}$	${0.819}^{\pm.007}$	${0.040}^{\pm.042}$	${2.937}^{\pm.063}$	${9.617}^{\pm.115}$	${2.701}^{\pm.031}$
T:50% S:50%	${0.524}^{\pm.006}$	${0.712}^{\pm.003}$	${0.822}^{\pm.006}$	${0.048}^{\pm.025}$	${2.943}^{\pm.037}$	${9.623}^{\pm.153}$	${2.620}^{\pm.025}$

🔼 This ablation study investigates the impact of varying the masking ratio on the performance of the attention-based masking method within the Motion Anything model. The study uses HumanML3D [19] as the benchmark dataset. The results show the model’s performance across different masking ratios for both temporal and spatial dimensions. Metrics evaluated include FID (Fréchet Inception Distance), MultiModal Distance, Diversity, and MultiModality. Higher R-Precision values and lower FID values indicate better performance, while closer values to ground truth for MultiModal Distance are preferred. The table highlights the optimal masking ratio that balances performance and robustness.
read the caption
Table 6: Ablation study of masking ratio on HumanML3D [19]. The best and runner-up values are bold and underlined. The right arrow →→\rightarrow→ indicates that closer values to ground truth are better.

Method	R Precision $\uparrow$			FID $\downarrow$	MM Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Method	Top 1	Top 2	Top 3	FID $\downarrow$	MM Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Ground Truth	$0.511^{\pm.003}$	$0.703^{\pm.003}$	$0.797^{\pm.002}$	$0.002^{\pm.000}$	$2.974^{\pm.008}$	$9.503^{\pm.065}$	-
Cross-modal Attention	${0.347}^{\pm.006}$	${0.587}^{\pm.007}$	${0.726}^{\pm.005}$	${0.583}^{\pm.024}$	${3.356}^{\pm.022}$	${9.032}^{\pm.153}$	${2.153}^{\pm.056}$
Motion Anything	$\mathbf{0.546}^{\pm.003}$	$\mathbf{0.735}^{\pm.002}$	$\mathbf{0.829}^{\pm.002}$	$\mathbf{0.028}^{\pm.005}$	$\mathbf{2.859}^{\pm.010}$	$\mathbf{9.521}^{\pm.083}$	$\mathbf{2.705}^{\pm.068}$

🔼 This ablation study investigates the impact of using a Temporal Adaptive Transformer (TAT) in the Motion Anything model for text-to-motion generation on the HumanML3D benchmark. It compares the model’s performance (measured by R Precision, FID, MultiModal Distance, Diversity, and MultiModality) when using the TAT against a baseline where a cross-modal attention layer is used instead. The results help determine if the proposed TAT architecture is crucial for optimal performance in this specific text-to-motion task.
read the caption
Table 7: Ablation study of the TAT on HumanML3D [19]. The best values are bold. The right arrow →→\rightarrow→ indicates that closer values to ground truth are better.

	Motion Quality		Motion Diversity
Method	FID ${}_{k}\downarrow$	FID ${}_{g}\downarrow$	Div ${}_{k}\uparrow$	Div ${}_{g}\uparrow$	BAS $\uparrow$	MMDist $\downarrow$	MModality $\uparrow$
Ground Truth	20.72	11.37	7.42	6.94	0.2105	5.07	-
Motion Anything w/o text	25.07	14.23	6.95	6.01	0.2077	6.24	2.398
Motion Anything	$\mathbf{21.46}$	$\mathbf{11.44}$	$\mathbf{7.04}$	$\mathbf{6.15}$	$\mathbf{0.2094}$	$\mathbf{5.34}$	$\mathbf{2.424}$

🔼 This table presents a comparison of motion generation results using single-modal (music only) versus multi-modal (music and text) conditioning on the TMD dataset. It shows quantitative metrics, such as FID, to evaluate the quality, diversity, and alignment of generated dance movements with music and text. This comparison highlights the impact of incorporating multiple modalities for improved motion generation.
read the caption
Table 8: Single-modal vs. multimodal generation on TMD dataset.

Method	R Precision $\uparrow$			FID $\downarrow$	MM Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Method	Top 1	Top 2	Top 3	FID $\downarrow$	MM Dist $\downarrow$	Diversity $\rightarrow$	MModality $\uparrow$
Ground Truth	$0.511^{\pm.003}$	$0.703^{\pm.003}$	$0.797^{\pm.002}$	$0.002^{\pm.000}$	$2.974^{\pm.008}$	$9.503^{\pm.065}$	-
$N$ = 2	${0.521}^{\pm.006}$	${0.725}^{\pm.008}$	${0.819}^{\pm.005}$	${0.079}^{\pm.019}$	${2.916}^{\pm.033}$	${9.598}^{\pm.117}$	${2.503}^{\pm.024}$
$N$ = 4	$\mathbf{0.546}^{\pm.003}$	$\mathbf{0.735}^{\pm.002}$	$\mathbf{0.829}^{\pm.002}$	$\mathbf{0.028}^{\pm.005}$	${2.859}^{\pm.010}$	${9.521}^{\pm.083}$	${2.705}^{\pm.068}$
$N$ = 6	${0.541}^{\pm.007}$	${0.733}^{\pm.002}$	${0.826}^{\pm.010}$	${0.029}^{\pm.004}$	${2.861}^{\pm.010}$	$\mathbf{9.517}^{\pm.094}$	${2.673}^{\pm.019}$
$N$ = 8	${0.544}^{\pm.009}$	${0.734}^{\pm.003}$	${0.826}^{\pm.007}$	$\mathbf{0.028}^{\pm.014}$	$\mathbf{2.851}^{\pm.011}$	${9.519}^{\pm.057}$	$\mathbf{2.711}^{\pm.032}$

🔼 This ablation study investigates the impact of varying the number of layers (N) within the masked transformers of the Motion Anything model. The results are evaluated on the HumanML3D [19] benchmark, assessing the influence of different layer configurations on the model’s performance in text-to-motion generation. Metrics such as R-Precision (Top 1, Top 2, Top 3), FID, MultiModal Distance, Diversity, and MultiModality are used to comprehensively evaluate the model’s robustness and efficacy across various layer depths.
read the caption
Table 9: Ablation study of number of layers on HumanML3D [19].

Datasets	Method	R Precision $\uparrow$			FID $\downarrow$	MultiModal Dist $\downarrow$	Diversity $\rightarrow$	MultiModality $\uparrow$
Datasets	Method	Top 1	Top 2	Top 3	FID $\downarrow$	MultiModal Dist $\downarrow$	Diversity $\rightarrow$	MultiModality $\uparrow$
Human ML3D [19]	Ground Truth	$0.511^{\pm.003}$	$0.703^{\pm.003}$	$0.797^{\pm.002}$	$0.002^{\pm.000}$	$2.974^{\pm.008}$	$9.503^{\pm.065}$	-
	TEMOS [43]	$0.424^{\pm.002}$	$0.612^{\pm.002}$	$0.722^{\pm.002}$	$3.734^{\pm.028}$	$3.703^{\pm.008}$	$8.973^{\pm.071}$	$0.368^{\pm.018}$
	TM2T [20]	$0.424^{\pm.003}$	$0.618^{\pm.003}$	$0.729^{\pm.002}$	$1.501^{\pm.017}$	$3.467^{\pm.011}$	$8.589^{\pm.076}$	$2.424^{\pm.093}$
	T2M [19]	$0.457^{\pm.002}$	$0.639^{\pm.003}$	$0.740^{\pm.003}$	$1.067^{\pm.002}$	$3.340^{\pm.008}$	$9.188^{\pm.002}$	$2.090^{\pm.083}$
	TM2D [17]	$0.319^{\pm.000}$	-	-	$1.021^{\pm.000}$	$4.098^{\pm.000}$	$\mathbf{9.513}^{\pm.000}$	$\mathbf{4.139}^{\pm.000}$
	MotionGPT (Zhang et al.) [74]	$0.364^{\pm.005}$	$0.533^{\pm.003}$	$0.629^{\pm.004}$	$0.805^{\pm.002}$	$3.914^{\pm.013}$	$9.972^{\pm.026}$	$2.473^{\pm.041}$
	MotionDiffuse [72]	${0.491}^{\pm.001}$	${0.681}^{\pm.001}$	${0.782}^{\pm.001}$	$0.630^{\pm.001}$	${3.113}^{\pm.001}$	${9.410}^{\pm.049}$	$1.553^{\pm.042}$
	MDM [53]	$0.320^{\pm.005}$	$0.498^{\pm.004}$	$0.611^{\pm.007}$	${0.544}^{\pm.044}$	$5.566^{\pm.027}$	${9.559}^{\pm.086}$	$2.799^{\pm.072}$
	MotionLLM [61]	0.482^±.004	0.672^±.003	0.770^±.002	0.491^±.019	3.138^±.010	9.838^±.244	-
	MLD [10]	${0.481}^{\pm.003}$	${0.673}^{\pm.003}$	${0.772}^{\pm.002}$	${0.473}^{\pm.013}$	${3.196}^{\pm.010}$	$9.724^{\pm.082}$	${2.413}^{\pm.079}$
	M2DM [30]	${0.497}^{\pm.003}$	${0.682}^{\pm.002}$	${0.763}^{\pm.003}$	${0.352}^{\pm.005}$	${3.134}^{\pm.010}$	$9.926^{\pm.073}$	$\underline{3.587}^{\pm.072}$
	MotionLCM [13]	${0.502}^{\pm.003}$	${0.698}^{\pm.002}$	${0.798}^{\pm.002}$	${0.304}^{\pm.012}$	${3.012}^{\pm.007}$	${9.607}^{\pm.066}$	$2.259^{\pm.092}$
	Motion Mamba [78]	${0.502}^{\pm.003}$	${0.693}^{\pm.002}$	${0.792}^{\pm.002}$	${0.281}^{\pm.009}$	${3.060}^{\pm.058}$	${9.871}^{\pm.084}$	$2.294^{\pm.058}$
	Fg-T2M [59]	${0.492}^{\pm.002}$	${0.683}^{\pm.003}$	${0.783}^{\pm.002}$	${0.243}^{\pm.019}$	${3.109}^{\pm.007}$	${9.278}^{\pm.072}$	$1.614^{\pm.049}$
	MotionGPT (Jiang et al.) [27]	${0.492}^{\pm.003}$	${0.681}^{\pm.003}$	${0.778}^{\pm.002}$	${0.232}^{\pm.008}$	${3.096}^{\pm.008}$	${9.528}^{\pm.071}$	$2.008^{\pm.084}$
	MotionGPT-2 [60]	0.496^±.002	0.691^±.003	0.782^±.004	0.191^±.004	3.080^±.013	9.860^±.026	2.137^±.022
	MotionCraft [5]	${0.501}^{\pm.003}$	${0.697}^{\pm.003}$	${0.796}^{\pm.002}$	${0.173}^{\pm.002}$	${3.025}^{\pm.008}$	${9.543}^{\pm.098}$	-
	FineMoGen [71]	${0.504}^{\pm.002}$	${0.690}^{\pm.002}$	${0.784}^{\pm.002}$	${0.151}^{\pm.008}$	${2.998}^{\pm.008}$	${9.263}^{\pm.094}$	${2.696}^{\pm.079}$
	T2M-GPT [69]	${0.492}^{\pm.003}$	${0.679}^{\pm.002}$	${0.775}^{\pm.002}$	${0.141}^{\pm.005}$	${3.121}^{\pm.009}$	${9.722}^{\pm.082}$	$1.831^{\pm.048}$
	GraphMotion [29]	${0.504}^{\pm.003}$	${0.699}^{\pm.002}$	${0.785}^{\pm.002}$	${0.116}^{\pm.007}$	${3.070}^{\pm.008}$	${9.692}^{\pm.067}$	$2.766^{\pm.096}$
	EMDM [82]	${0.498}^{\pm.007}$	${0.684}^{\pm.006}$	${0.786}^{\pm.006}$	${0.112}^{\pm.019}$	${3.110}^{\pm.027}$	${9.551}^{\pm.078}$	$1.641^{\pm.078}$
	AttT2M [81]	${0.499}^{\pm.003}$	${0.690}^{\pm.002}$	${0.786}^{\pm.002}$	${0.112}^{\pm.006}$	${3.038}^{\pm.007}$	${9.700}^{\pm.090}$	$2.452^{\pm.051}$
	GUESS [16]	${0.503}^{\pm.003}$	${0.688}^{\pm.002}$	${0.787}^{\pm.002}$	${0.109}^{\pm.007}$	${3.006}^{\pm.007}$	${9.826}^{\pm.104}$	$2.430^{\pm.100}$
	ParCo [87]	${0.515}^{\pm.003}$	${0.706}^{\pm.003}$	${0.801}^{\pm.002}$	${0.109}^{\pm.005}$	${2.927}^{\pm.008}$	${9.576}^{\pm.088}$	$1.382^{\pm.060}$
	ReMoDiffuse [70]	${0.510}^{\pm.005}$	${0.698}^{\pm.006}$	${0.795}^{\pm.004}$	${0.103}^{\pm.004}$	${2.974}^{\pm.016}$	${9.018}^{\pm.075}$	$1.795^{\pm.043}$
	MotionCLR [9]	${0.542}^{\pm.001}$	${0.733}^{\pm.002}$	${0.827}^{\pm.003}$	${0.099}^{\pm.003}$	${2.981}^{\pm.011}$	-	$2.145^{\pm.043}$
	StableMoFusion [25]	$\mathbf{0.553}^{\pm.003}$	$\mathbf{0.748}^{\pm.002}$	$\mathbf{0.841}^{\pm.002}$	${0.098}^{\pm.003}$	-	${9.748}^{\pm.092}$	$1.774^{\pm.051}$
	MMM [45]	${0.504}^{\pm.003}$	${0.696}^{\pm.003}$	${0.794}^{\pm.002}$	${0.080}^{\pm.003}$	${2.998}^{\pm.007}$	${9.411}^{\pm.058}$	$1.164^{\pm.041}$
	DiverseMotion [41]	${0.515}^{\pm.003}$	${0.706}^{\pm.002}$	${0.802}^{\pm.002}$	${0.072}^{\pm.004}$	${2.941}^{\pm.007}$	${9.683}^{\pm.102}$	$1.869^{\pm.089}$
	BAD [22]	${0.517}^{\pm.002}$	${0.713}^{\pm.003}$	${0.808}^{\pm.003}$	${0.065}^{\pm.003}$	${2.901}^{\pm.008}$	${9.694}^{\pm.068}$	$1.194^{\pm.044}$
	BAMM [44]	${0.525}^{\pm.002}$	${0.720}^{\pm.003}$	${0.814}^{\pm.003}$	${0.055}^{\pm.002}$	${2.919}^{\pm.008}$	${9.717}^{\pm.089}$	$1.687^{\pm.051}$
	MCM [39]	${0.502}^{\pm.002}$	${0.692}^{\pm.004}$	${0.788}^{\pm.006}$	${0.053}^{\pm.007}$	${3.037}^{\pm.003}$	${9.585}^{\pm.082}$	$0.810^{\pm.023}$
	MoMask [21]	${0.521}^{\pm.002}$	${0.713}^{\pm.002}$	${0.807}^{\pm.002}$	${0.045}^{\pm.002}$	${2.958}^{\pm.008}$	-	$1.241^{\pm.040}$
	LMM [73]	${0.525}^{\pm.002}$	${0.719}^{\pm.002}$	${0.811}^{\pm.002}$	${0.040}^{\pm.002}$	${2.943}^{\pm.012}$	$9.814^{\pm.076}$	$2.683^{\pm.054}$
	MoGenTS [64]	${0.529}^{\pm.003}$	${0.719}^{\pm.002}$	${0.812}^{\pm.002}$	$\underline{0.033}^{\pm.001}$	$\underline{2.867}^{\pm.006}$	$9.570^{\pm.077}$	-
	Motion Anything (Ours)	$\underline{0.546}^{\pm.003}$	$\underline{0.735}^{\pm.002}$	$\underline{0.829}^{\pm.002}$	$\mathbf{0.028}^{\pm.005}$	$\mathbf{2.859}^{\pm.010}$	$\underline{9.521}^{\pm.083}$	$2.705^{\pm.068}$
KIT- ML [46]	Ground Truth	$0.424^{\pm.005}$	$0.649^{\pm.006}$	$0.779^{\pm.006}$	$0.031^{\pm.004}$	$2.788^{\pm.012}$	$11.08^{\pm.097}$	-
	TEMOS [43]	$0.353^{\pm.006}$	$0.561^{\pm.007}$	$0.687^{\pm.005}$	$3.717^{\pm.051}$	$3.417^{\pm.019}$	$10.84^{\pm.100}$	$0.532^{\pm.034}$
	TM2T [20]	$0.280^{\pm.005}$	$0.463^{\pm.006}$	$0.587^{\pm.005}$	$3.599^{\pm.153}$	$4.591^{\pm.026}$	$9.473^{\pm.117}$	$3.292^{\pm.081}$
	T2M [19]	$0.370^{\pm.005}$	$0.569^{\pm.007}$	$0.693^{\pm.007}$	$2.770^{\pm.109}$	$3.401^{\pm.008}$	$10.91^{\pm.119}$	$1.482^{\pm.065}$
	MotionDiffuse [72]	${0.417}^{\pm.004}$	${0.621}^{\pm.004}$	${0.739}^{\pm.004}$	$1.954^{\pm.062}$	${2.958}^{\pm.005}$	${11.10}^{\pm.143}$	$0.753^{\pm.013}$
	MDM [53]	$0.164^{\pm.004}$	$0.291^{\pm.004}$	$0.396^{\pm.004}$	${0.497}^{\pm.021}$	$9.190^{\pm.022}$	${10.85}^{\pm.109}$	$1.907^{\pm.214}$
	MLD [10]	${0.390}^{\pm.003}$	${0.609}^{\pm.003}$	${0.734}^{\pm.002}$	${0.404}^{\pm.013}$	${3.204}^{\pm.010}$	$10.80^{\pm.082}$	${2.192}^{\pm.079}$
	M2DM [30]	${0.405}^{\pm.003}$	${0.629}^{\pm.005}$	${0.739}^{\pm.004}$	${0.502}^{\pm.049}$	${3.012}^{\pm.015}$	$11.38^{\pm.079}$	$\underline{3.273}^{\pm.045}$
	Motion Mamba [78]	${0.419}^{\pm.006}$	${0.645}^{\pm.005}$	${0.765}^{\pm.006}$	${0.307}^{\pm.041}$	${3.021}^{\pm.025}$	${11.02}^{\pm.098}$	$1.678^{\pm.064}$
	Fg-T2M [59]	${0.418}^{\pm.005}$	${0.626}^{\pm.004}$	${0.745}^{\pm.004}$	${0.571}^{\pm.047}$	${3.114}^{\pm.015}$	${10.93}^{\pm.083}$	$1.019^{\pm.029}$
	MotionGPT (Zhang et al.) [74]	0.340^±.002	0.570^±.003	0.660^±.004	0.868^±.032	3.721^±.018	9.972^±.026	2.296^±.022
	MotionGPT (Jiang et al.) [27]	${0.366}^{\pm.005}$	${0.558}^{\pm.004}$	${0.680}^{\pm.005}$	${0.510}^{\pm.016}$	${3.527}^{\pm.021}$	${10.35}^{\pm.084}$	$2.328^{\pm.117}$
	MotionGPT-2 [60]	0.427^±.003	0.627^±.002	0.764^±.003	0.614^±.005	3.164^±.013	11.26^±.026	2.357^±.022
	FineMoGen [71]	${0.432}^{\pm.006}$	${0.649}^{\pm.005}$	${0.772}^{\pm.006}$	${0.178}^{\pm.007}$	${2.869}^{\pm.014}$	${10.85}^{\pm.115}$	${1.877}^{\pm.093}$
	T2M-GPT [69]	${0.416}^{\pm.006}$	${0.627}^{\pm.006}$	${0.745}^{\pm.006}$	${0.514}^{\pm.029}$	${3.007}^{\pm.023}$	${10.92}^{\pm.108}$	$1.570^{\pm.039}$
	GraphMotion [29]	${0.417}^{\pm.008}$	${0.635}^{\pm.006}$	${0.755}^{\pm.004}$	${0.262}^{\pm.021}$	${3.085}^{\pm.031}$	${11.21}^{\pm.106}$	$\mathbf{3.568}^{\pm.132}$
	EMDM [82]	${0.443}^{\pm.006}$	${0.660}^{\pm.006}$	${0.780}^{\pm.005}$	${0.261}^{\pm.014}$	${2.874}^{\pm.015}$	${10.96}^{\pm.093}$	$1.343^{\pm.089}$
	AttT2M [81]	${0.413}^{\pm.006}$	${0.632}^{\pm.006}$	${0.751}^{\pm.006}$	${0.870}^{\pm.039}$	${3.039}^{\pm.021}$	${10.96}^{\pm.123}$	$2.281^{\pm.047}$
	GUESS [16]	${0.425}^{\pm.005}$	${0.632}^{\pm.007}$	${0.751}^{\pm.005}$	${0.371}^{\pm.020}$	${2.421}^{\pm.022}$	${10.93}^{\pm.110}$	$2.732^{\pm.084}$
	ParCo [87]	${0.430}^{\pm.004}$	${0.649}^{\pm.007}$	${0.772}^{\pm.006}$	${0.453}^{\pm.027}$	${2.820}^{\pm.028}$	${10.95}^{\pm.094}$	$1.245^{\pm.022}$
	ReMoDiffuse [70]	${0.427}^{\pm.014}$	${0.641}^{\pm.004}$	${0.765}^{\pm.055}$	${0.155}^{\pm.006}$	${2.814}^{\pm.012}$	${10.80}^{\pm.105}$	$1.239^{\pm.028}$
	StableMoFusion [25]	$\underline{0.445}^{\pm.006}$	${0.660}^{\pm.005}$	${0.782}^{\pm.004}$	${0.258}^{\pm.029}$	-	${10.94}^{\pm.077}$	$1.362^{\pm.062}$
	MMM [45]	${0.404}^{\pm.005}$	${0.621}^{\pm.005}$	${0.744}^{\pm.004}$	${0.316}^{\pm.028}$	${2.977}^{\pm.019}$	${10.91}^{\pm.101}$	$1.232^{\pm.039}$
	DiverseMotion [41]	${0.416}^{\pm.006}$	${0.637}^{\pm.008}$	${0.760}^{\pm.011}$	${0.468}^{\pm.098}$	${2.892}^{\pm.041}$	${10.87}^{\pm.101}$	$2.062^{\pm.079}$
	BAD [22]	${0.417}^{\pm.006}$	${0.631}^{\pm.006}$	${0.750}^{\pm.006}$	${0.221}^{\pm.012}$	${2.941}^{\pm.025}$	$\mathbf{11.00}^{\pm.100}$	$1.170^{\pm.047}$
	BAMM [44]	${0.438}^{\pm.009}$	${0.661}^{\pm.009}$	${0.788}^{\pm.005}$	${0.183}^{\pm.013}$	${2.723}^{\pm.026}$	$\underline{11.01}^{\pm.094}$	$1.609^{\pm.065}$
	MoMask [21]	${0.433}^{\pm.007}$	${0.656}^{\pm.005}$	${0.781}^{\pm.005}$	${0.204}^{\pm.011}$	${2.779}^{\pm.022}$	-	$1.131^{\pm.043}$
	LMM [73]	${0.430}^{\pm.015}$	${0.653}^{\pm.017}$	${0.779}^{\pm.014}$	$\underline{0.137}^{\pm.023}$	${2.791}^{\pm.018}$	$11.24^{\pm.103}$	${1.885}^{\pm.127}$
	MoGenTS [64]	$\underline{0.445}^{\pm.006}$	$\underline{0.671}^{\pm.006}$	$\underline{0.797}^{\pm.005}$	${0.143}^{\pm.004}$	$\underline{2.711}^{\pm.024}$	$10.92^{\pm.090}$	-
	Motion Anything (Ours)	$\mathbf{0.449}^{\pm.007}$	$\mathbf{0.678}^{\pm.004}$	$\mathbf{0.802}^{\pm.006}$	$\mathbf{0.131}^{\pm.003}$	$\mathbf{2.705}^{\pm.024}$	${10.94}^{\pm.098}$	${1.374}^{\pm.069}$

🔼 Table 1 presents a comprehensive quantitative comparison of the proposed Motion Anything model against various state-of-the-art methods for text-to-motion generation on the HumanML3D and KIT-ML benchmarks. Evaluation metrics include Top-1, Top-2, and Top-3 Recall Precision (higher is better), Fréchet Inception Distance (FID; lower is better), MultiModal Distance (lower is better), Diversity (higher is better), and MultiModality (higher is better). The table highlights the superior performance of Motion Anything across multiple metrics. Methods using multiple input modalities (multimodal) are shown in blue. Bold and underlined values indicate the top-performing results for each metric.
read the caption
Table 1: Comprehensive comparison on HumanML3D [19] and KIT-ML [46]. The best and runner-up values are bold and underlined. The right arrow →→\rightarrow→ indicates that closer values to ground truth are better. Multimodal motion generation methods are highlighted in blue.

Method	$S$ $\rightarrow 1$	AIT(s) $\downarrow$
MagicPose4D	$1.93$	$0.138$
SRM ( $k=1$ )	$1.78$	$0.094$
SRM ( $k=3$ )	$1.36$	$0.105$
SRM ( $k=5$ )	$1.06$	$0.117$

🔼 Table 2 presents a quantitative evaluation of the Selective Rigging Mechanism (SRM) used in 4D avatar generation. It compares the performance of SRM with different numbers of candidate avatars (k=1, 3, and 5) in terms of Average Inference Time (AIT), and a quality metric based on the deviation of the average joint weight sum from the ideal value of 1. Lower values of AIT indicate faster processing time, while values closer to 1 for the average joint weight sum imply better rigging quality and stability.
read the caption
Table 2: SRM evaluation.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Attention Masking#

Multi-Modal TMD#

Adaptive Control#

4D Avatars#

Key-Frame Focus#

More visual insights#

Full paper#