Speech-Driven Facial Animation with Spectral Gathering and Temporal Attention
1   State Key Lab of CAD&CG, Zhejiang University, China
2   FaceUnity Technology Inc., China
Published: 27 Sep, 2021
Abstract
In this paper we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-the-art automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.
Supplementary Video
Citation
@article{chai2022speech,
title={Speech-driven facial animation with spectral gathering and temporal attention},
author={Chai, Yujin and Weng, Yanlin and Wang, Lvdi and Zhou, Kun},
journal={Frontiers of Computer Science},
volume={16},
number={3},
pages={1--10},
year={2022},
publisher={Springer}
}