Active Appearance Models work by creating a deformable model of appearance built by combining a point distribution model with a texture model using principal component analysis (PCA). Basically what that means you take a bunch of faces, located landmark points like the corners of eyes, apex of the chin, etc., create an average face and use PCA to statistically model the variations. Next you morph the faces together from their original landmark points to average face, and do a PCA on the pixel values. This creates a pixel-wise model of 'texture' which also models variations. With these two parts you have a thing that (with a good sampling of face) models most faces and emotions with about say 80 to 120 numbers and statistical ranges for those numbers.
So how do you use this to track faces? Well you use gradient decent to optimize the appearance of the face to image by adjusting those 80-120 values, x, y, scale, rotation until the pixel difference is close to zero. The trick is the gradients are approximated by precomputed derivative images, but this only works if the model is initialized on top of the original face. You can see in the video, he used Viola-Jones (the green squares) to locate the face and then dropped the AAM on top of it. He's only showing the landmark points and not the texture model.
I did my dissertation on this almost a decade ago for tracking MRs of hearts, even back then it was pretty fast. What's interesting is not only can the model identify, but you can also reconstruct synthetic images of faces, and the model parameters could be used for identifying a person, identifying an emotion, creating a synthetic face swapping another person's identity but keeping the same parameters for expression, etc. My own implementation reliably detected anomalies in beating hearts.
I really wanted to build a business around it back then, but it was in conflict with my advisor and university at the time.