AI Avatar Lip Sync Quality: How to measure and achieve the highest accuracy

With the increasing use of avatars powered by Large Language Models and conversational AI the future of human-like interaction is just around the corner. However, natural conversation also requires natural avatar lip movements. But with most avatar providers unable to provide a measure of lip sync quality, do you know how good your avatar is?
‍

In this blog post we will discuss why evaluating lip sync quality matters and how Emotech can help you measure the naturalness of your avatar’s lip sync and help you bring the best content quality to market.

What is lip sync?

Lip synchronization, or lip sync for short, is the process of generating lip movements in an artificial avatar that match the actual lip movements of a human when saying the same words. Correct lip sync is extremely important in this scenario because humans are very attuned to follow lip movements when interacting with other humans. Slight but perceptible errors in lip sync can result in otherwise expensively designed avatars being at best disliked and at worst rejected due to the uncanny valley effect.
‍

Two main characteristics are required for a natural and human-like lip sync: The avatar’s lips shapes have to replicate the shape of human lips when making individual sounds; think for instance how your lips narrow when you make the /oo/ sound in /boot/. And the avatar’s lips have to follow a smooth trajectory from sound to sound in the same way human lips do; the same way your mouth closes and opens for the /b/ sound in /boot/. Both these conditions are necessary for correct and natural lip sync, but achieving both with high quality can be extremely difficult.

How to measure good lip sync?

But although humans are very good at detecting lack of synchronization between lip movements and speech, they cannot give a precise objective measure of how much better or worse different lip sync techniques are. People can describe an avatar interaction as pleasant or off-putting, but they cannot describe an avatar's naturalness as 86.5% that of a real human.

Other AI technologies have objective, uniformised measures of quality. For example, you can measure if your computer vision algorithm detects 97.6% of dogs, or if your speech to text system correctly transcribes 94% of words. To bridge this problem, Emotech has developed an objective measure of lip sync quality that can put your avatar quality into numbers and compare different providers.

A measure of lip sync quality

The developed measure of lip sync quality requires the comparison between pairs of videos, with one set of videos having a real human uttering a sentence and the other set having the avatar pronouncing the same sentences. From this video, the location and shape of the lips in both human and avatar are extracted using computer vision for each single frame of each pair of videos.

‍

Emotech’s measure compares the shape of the lips in all frames that contain the same sound. This provides a score between 0 and 100 of how well the lip shapes match, with 100 indicating perfect shaping in the avatar compared to the human. After calculating this value for each avatar video frame, an overall score for the whole sentence is calculated, again with a maximum score of 100 meaning perfect matching across the whole sentence.

‍

Achieve human-like naturalness today!

With these metrics removing the guesswork of having to figure out the quality of your lip synced videos, content generators do not need to settle for suboptimal technology any more. With Emotech’s lip sync achieving 90% similarity to humans, why not try Emotech's state of the art technology for yourself today?

‍