UniMo AI Unifies 2D Videos and 3D Motions

In the ever-evolving landscape of artificial intelligence and machine learning, a new breakthrough is set to redefine how we understand and generate human motion. Researchers have introduced UniMo, an innovative autoregressive model that unifies the modeling of 2D human videos and 3D human motions within a single framework. This advancement marks a significant departure from current methods, which typically focus on generating one modality based on another or integrating these modalities with text and audio.

The challenge of unifying 2D videos and 3D motions lies in their substantial structural and distributional differences. However, inspired by the ability of large language models (LLMs) to unify different modalities, the researchers behind UniMo have devised a method that treats videos and 3D motions as a unified sequence of tokens. By using separate embedding layers, they effectively mitigate the distribution gaps between these modalities.

One of the standout features of UniMo is its sequence modeling strategy, which integrates two distinct tasks within a single framework. This approach has proven to be highly effective, showcasing the potential of unified modeling. To further enhance the alignment with visual tokens and preserve 3D spatial information, the researchers designed a novel 3D motion tokenizer. This tokenizer employs a temporal expansion strategy and utilizes a single Vector Quantized Variational Autoencoder (VQ-VAE) to produce quantized motion tokens. The VQ-VAE features multiple expert decoders that handle various aspects of 3D motion, including body shapes, translation, global orientation, and body poses, ensuring reliable 3D motion reconstruction.

Extensive experiments have demonstrated that UniMo can simultaneously generate corresponding videos and motions while performing accurate motion capture. This work not only taps into the capacity of LLMs to fuse diverse data types but also paves the way for integrating human-centric information into existing models. The potential applications are vast, potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.

The introduction of UniMo represents a significant leap forward in the field of human motion modeling. By unifying 2D videos and 3D motions, this innovative framework opens up new possibilities for generating and understanding human movement. As researchers continue to explore and refine these capabilities, we can expect to see even more sophisticated applications that bridge the gap between different data modalities, ultimately enhancing our ability to interact with and understand the world around us.

Scroll to Top