ATL-Diff: Audio-Driven Talking Head Generation using Early Landmark Guide Noise Diffusion.

date

Dec 5, 2024

slug

pub-atl-diff

status

Published

Abstract:

Audio-driven talking head generation presents significant challenges in creating realistic facial animations that accurately synchronize with audio signals. This paper introduces ATL-Diff, a novel approach that addresses key limitations in existing methods through an innovative three-component framework. The Landmark Generation Module supports to construction a sequence of landmarks from audio. Landmarks Guide Noise is the approach adds movement information by distributing the noise following landmarks so it isolates audio from the model. 3D Identity Diffusion network to preserve keep the identity characteristics. Experimental validation on the MEAD and CREMA-D datasets demonstrates the method’s superior performance. ATL-Diff significantly outperforms state-of-the-art techniques across all critical metrics. The approach achieves near real-time processing capabilities, generating high-quality facial animations with exceptional computational efficiency and remarkable preservation of individual facial nuances. By bridging audio signals and facial movements with unprecedented precision, this research advances talking head generation technologies with promising applications in virtual assistants, education, medical communication, and emerging digital platforms.

Overview Architecture:

Overview of the proposed method: The model consists of three components: the Landmarks Generation Module, the Landmarks Guide Noise approach, and 3D Identity Diffusion. Snowflake symbols indicate pre-trained models that will not be updated during training, while fire symbols represent models that will undergo training.

Qualitative results:

Qualitative comparisons with the state-of-the-art including We show two examples with source audio in CREMA dataset, the Identity samples we used are respectively 1101 and 1083. The red box is the weakness of the previous method when having limitations to build fine details in the eye area, and the blue box is our method that can reconstruct in detail the movement of the mouth.