Learning Uniformly Distributed Embedding Clusters of Stylistic Skills for Physically Simulated Characters

🔊 Turn on the sound for a better experience.


Main Video: Our controller not only generates high-quality, diverse motions covering the entire dataset but also achieves superior controllability, serving as a cornerstone for diverse applications.

Abstract

Learning natural and diverse behaviors from human motion datasets remains a significant challenge in physics-based character control. Existing conditional adversarial models often suffer from tight and biased embedding distributions where embeddings from the same motion are closely grouped in a small area and shorter motions occupy even less space. Our empirical observations indicate this limits the representational capacity and diversity under each skill. An ideal latent space should be maximally packed by all motion’s embedding clusters. Although methods that employ separate embedding space for each motion mitigate this limitation to some extent, introducing a hybrid discrete-continuous embedding space imposes a huge exploration burden on the high-level policy. To address the above limitations, we propose a versatile skill-conditioned controller that learns diverse skills with expressive variations. Our approach leverages the Neural Collapse phenomenon, a natural outcome of the classification-based encoder, to uniformly distributed cluster centers. We additionally propose a novel Embedding Expansion technique to form stylistic embedding clusters for diverse skills that are uniformly distributed on a hypersphere, maximizing the representational area occupied by each skill and minimizing unmapped regions. This maximally packed and uniformly distributed embedding space ensures that embeddings within the same cluster generate behaviors conforming to the characteristics of the corresponding motion clips, yet exhibiting noticeable variations within each cluster. Compared to existing methods, experimental results demonstrate that our controller not only generates high-quality, diverse motions covering the entire dataset but also achieves superior controllability, motion coverage, and diversity under each skill. Both qualitative and quantitative results confirm these traits, enabling our controller to be applied to a wide range of downstream tasks and serving as a cornerstone for diverse applications.

Pipeline

Recovering the Pre-Fine-Tuning Weight of an Aligned Mistral 7B

Our method uses a unit hypersphere as the embedding space to feature uniformly distributed embedding clusters for each skill. We first employ a classification-based encoder to distribute motion features uniformly on a high-dimensional sphere, then apply conditional imitation learning with the Embedding Expansion technique to form a stylistic skill embedding cluster for each skill, achieving a maximally packed and uniformly distributed space

1. Learning Uniformly Distributed Cluster Centers with Neural Collapse

2. Conditional Adversarial Imitation Learning with Embedding Expansion