No, it isn't. In that clip they are taking two different sound clips as they are switching faces. It's not changing the 'voice' of saying some phrase on the fly. It's two separate pre-recorded clips.
Literally from the article:
It does not clone or simulate voices (like other Microsoft research) but relies on an existing audio input that could be specially recorded or spoken for a particular purpose.