HK Researchers Revolutionize Talking Face Tech with Short Clips

Imagine creating a realistic, animated talking portrait of someone using just a few seconds of video instead of several minutes. That’s the promise of a new approach developed by researchers at the University of Hong Kong. Their work challenges the conventional wisdom in Talking Face Generation (TFG), a technology with applications ranging from digital education to film production.

Traditionally, TFG methods rely on Neural Radiance Fields (NeRF) or 3D Gaussian sputtering (3DGS) to generate lifelike talking videos. These methods require processing and fitting several minutes of reference video to capture sufficient 3D information and learn the lip-audio mapping. This process is not only time-consuming but also computationally intensive, limiting the practical use of these methods.

The researchers’ exploratory case studies reveal that using short, informative video segments of just a few seconds can achieve performance comparable to or even better than full-length reference videos. This suggests that the quality of the video segments is more crucial than their length. Inspired by this finding, they developed ISExplore, a simple yet effective segment selection strategy. ISExplore automatically identifies the most informative 5-second reference video segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and the number of camera views.

Extensive experiments demonstrate that ISExplore significantly speeds up data processing and training for NeRF and 3DGS methods—over five times faster—while maintaining high-fidelity output. This breakthrough could revolutionize the field of TFG, making it more accessible and efficient. The project resources are available for further exploration and implementation, inviting the broader community to build on this innovative work.

In essence, the researchers have shown that less can be more when it comes to generating realistic talking faces. By focusing on the quality of video segments rather than their length, they have paved the way for faster, more efficient, and equally effective TFG methods. This development could have far-reaching implications for various industries, from education to entertainment, making the technology more practical and widely applicable.

Scroll to Top