ATL-Diff: Audio-Driven Talking Head Generation using Early Landmark Guide Noise Diffusion.
date
Dec 5, 2024
slug
pub-atl-diff
status
Published
tags
Publication
Deep Learning
summary
Audio-driven talking head generation presents significant challenges in creating realistic facial animations that accurately synchronize with audio signals. This paper introduces ATL-Diff, a novel approach that addresses key limitations in existing methods through an innovative three-component framework.
type
Post
[Src code]
[Slide]
[Paper]
[Processdings]
Abstract:
Audio-driven talking head generation presents significant challenges in creating realistic facial animations that accurately synchronize with audio signals. This paper introduces ATL-Diff, a novel approach that addresses key limitations in existing methods through an innovative three-component framework. The Landmark Generation Module supports to construction a sequence of landmarks from audio. Landmarks Guide Noise is the approach adds movement information by distributing the noise following landmarks so it isolates audio from the model. 3D Identity Diffusion network to preserve keep the identity characteristics. Experimental validation on the MEAD and CREMA-D datasets demonstrates the method’s superior performance. ATL-Diff significantly outperforms state-of-the-art techniques across all critical metrics. The approach achieves near real-time processing capabilities, generating high-quality facial animations with exceptional computational efficiency and remarkable preservation of individual facial nuances. By bridging audio signals and facial movements with unprecedented precision, this research advances talking head generation technologies with promising applications in virtual assistants, education, medical communication, and emerging digital platforms.
Overview Architecture:

Qualitative results:

Qualitative comparisons with the state-of-the-art including We show two examples with source audio in CREMA dataset, the Identity samples we used are respectively 1101 and 1083. The red box is the weakness of the previous method when having limitations to build fine details in the eye area, and the blue box is our method that can reconstruct in detail the movement of the mouth.
