Mona Lisa rapping: Microsoft’s new AI spins up talking heads

Microsoft has showcased an AI model called VASA-1 that can generate lifelike talking faces of virtual (or real) characters from just a single static image and a short audio clip – but said it is not releasing it publicly yet.

VASA one can generate realistic lip-audio synchronisation, as well as capturing what Microsoft’s researchers described as a “large spectrum of emotions and expressive facial nuances and natural head motions.”

The model can realistically generate outputs even if what a user wants was not in the model’s training distribution. For example, Microsoft said, “it can handle artistic photos, singing audios, and non-English speech.”

0:00

/0:22

The Mona Lisa + Anne Hathaway: Paparazzi. Credit: Microsoft.

VASA-1 generates video frames of 512x512 at 45fps in its offline batch processing mode, and 40fps in its online streaming mode with latency of 170ms – as evaluated on a desktop PC with one NVIDIA RTX 4090 GPU.

Microsoft joins Meta in declining to publicly release a model that could be easily abused. In 2023 Meta showcased a generative AI model for speech, Voicebox, which can take an audio sample of just two-seconds and generate an authentic-sounding voice. Meta opted not to release the model or code until synthetic speech can be more easily detected.

It said it is working on techniques like embedding artificial fingerprints that can be trivially detected without hurting the speech quality.

See also: Meta says its new AI is too dangerous to release - but fingerprinting synthetic speech may support safe deployment

0:00

/1:00

Another example of VASA-1's AI-generated output. Credit: Microsoft.

Microsoft said that VASA-1 “paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.”

But in a research post on April 15, its researchers said Redmond has “no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.”

Numerous companies meanwhile are using AI-generated avatars for enterprise use-cases, including AI interviewers to pre-screen applicants and conduct tests that they say will reduce the burden on HR staff.

Both China and India have AI news anchors on air that have read bulletins; India’s even having “interviewed” a head of state, whilst AI-generated “influencers” and models increasingly proliferate on social platforms where they attract growing followings and even brand endorsements.

Easy access to genuinely convincing AI-generated video and speech models with minimal inputs, like VASA-1, have vast potential to be abused for political purposes or more simply social engineeing and cybercrime.

Microsoft said this week: “While acknowledging the possibility of misuse, it's imperative to recognize the substantial positive potential of our technique. The benefits – ranging from enhancing educational equity, improving accessibility for individuals with communication challenges, and offering companionship or therapeutic support to those in need – underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly…”

Mona Lisa rapping: Microsoft’s new AI spins up convincing talking heads

See also: Meta says its new AI is too dangerous to release - but fingerprinting synthetic speech may support safe deployment

Join peers following The Stack on LinkedIn