Tech Watch: High Quality Animation Automated With Machine Learning from Speech Graphics

Use of  Machine Learning Improves Quality and Shortens Time Needed For Large Scale Production Such as Animation

The recent headlines surrounding Mass Effect Andromeda’s quality of facial animation has brought to light some significant challenges in modern video game production. Developers are in a constant arms race for higher fidelity graphics while players and users in general demand deeper and longer content. These trends are reflected in game development budgets, where art production often represents over 60% of the total costs. This is compounded by the fact that iteration is a huge part of the development process. Many art assets and animations are created, recreated, discarded, and then recreated again. While iteration is an important and required part of the creative process, art creation is still extremely labour intensive so any iteration will increase time and costs significantly.

Facial animation and in particular lip synch is a very difficult art asset to create as it requires interdisciplinary knowledge and input. Speech articulation is the most complex muscle movement that humans can perform. In the past and still often today dialogue is animated by hand from a recorded sound track where each sound is carefully posed by an animator. For a high end production, a skilled animator can produce about three to five seconds of facial animation a day.

Facial motion capture is a semi-automated approach where a camera tracks the movements of the face which while much faster than traditional hand animation, can yield great results for gross facial movement, is error-prone for lip-synch because of occlusions of the inner contour of the lips. These occlusions will then have to be manually corrected or the lip-synch looks off when the lips never come together for sounds like ‘m’ or ‘p’. Both techniques can produce very compelling results if enough man hours are spent on animation and clean up but time is often in short supply.

Games such as Mass Effect, Fallout, or Skyrim have over a hundred thousand lines of dialogue which all need to be animated. In addition the story that is told through all that dialogue will most likely change over the course of the project, requiring rewrites and re-recordings of voice actors. Then once the development team is happy with the results in the development language, which is often English, all that recorded content will need to be localised which requires recording of voice actors in ten plus languages. That localised animated dialogue will either have to be dubbed, meaning the voice actor has to try to match the timing of the original animation or the faces will have to be reanimated to make the lip synch match the localised dialogue.

This example alone shows that faces in particular create many challenges in art production and the creative process which makes the case for a fully automated solution extremely strong. To overcome the challenges posed by dialogue animation in modern game production, developers really need a solution that produces accurate and compelling facial animation automatically from speech input in any language.

Recent advances in machine learning such as deep learning, the ever increasing computing power described by Moore’s law, and the advent of GPU-driven computing make it possible to significantly reduce labour intensive processes in art content production. What in the past, was called procedural content production, comes to mind which can create simple content such as Bananas in different shapes and shades of yellow from a single Banana example by varying colour and curvature. Deep learning based approaches on the other hand make for a much more compelling solution. Multi-layered deep neural networks learn levels of abstraction and representation that make sense of data such as art assets. 

A deep learning system, once it has seen enough training examples, can in theory produce an unlimited amount of complex content by automatically learning the relationship between thousands of input parameters. Gas stations in shooters are an interesting example, in the sense that almost every game has one, which was probably designed from scratch by an artist. By feeding labeled examples of gas stations into a deep learner, it will automatically determine what the distinguishing features are of these gas stations, such as mesh structure, intersections, height to length to width ratio, etc and lets a user create an infinite amount of new gas stations automatically by giving the system a constrained set of input parameters such as polygon count and footprint.

The key is to have the system learn from the right data. As with any machine learning approach the principle of garbage in, garbage out holds for deep learning as well. Without seeing any good art, the system will not be able to produce good art. Therefore a lot of skill and knowledge goes into preparing the training data to make sure it accurately represents the space that needs to be modelled, be it bananas, gas stations, or facial animation.

The trick is that many development studios already have a lot of good examples of art assets in their back catalogue. In principle this would allow them to use their own previous assets as training data for a deep learning system including vast amounts of recorded dialogue and animation. However, the challenge for facial animation in particular is that it requires learning a mapping from one complex space, namely speech to another abstract space, facial animation. This is further complicated by the fact that in contrast to gas stations, speech and animation are not stationary, but change rapidly and widely over time.

The good news is that deep learning excels at complex mapping tasks and can create extremely powerful models given enough training data. There is a vast body of research from speech recognition and speech synthesis that gives a good foundation on how to model speech which combined with expert knowledge in computer graphics enables the training of systems that automatically produce facial animation from speech input alone. Furthermore the same principle of constraining machine learning with expert knowledge can be applied to any type of asset giving developers the ability to create vast amounts of content quicker and cheaper while saving resources in the process.

Game Studios can now implement machine learning and deep learning in their art production pipelines to meet the challenges posed by continually increasing graphics requirements, as well as the consumers’ demand for immersive quality content. Especially for such time-intensive production as facial animation, which plays a significant part in player engagement through storytelling. Third party vendors such as Speech Graphics are bringing their expertise in deep learning, not generally inherent in the game industry, to elevate those thousands of lines of dialogue animation to a standard that modern immersive game productions demand and the high quality players expect.

Dr. Gregor Hofer is CEO and co-founder of Speech Graphics. He holds a PhD in speech technology from the University of Edinburgh, where he focused on using machine learning to drive character animation from speech. He is published both in the fields of computer graphics and speech technology. Speech Graphics production software SGX was officially released at the end of 2016 and has been used by Microsoft’s Gears of War 4 to successfully animate thousands of lines of gameplay dialogue.