AI dubbing benchmark arrives to separate hype from reality

The AI dubbing industry is booming, with tools promising to translate and replicate actor performances across languages in an instant. But how good are they really? Until now, judging the quality of these systems has been a subjective mess. Amsterdam-based AI data firm Toloka is aiming to fix that with VOX-DUB, the first open, human-evaluated AI dubbing benchmark designed to bring some much-needed accountability to the sector.

VOX-DUB moves beyond the simple metrics used for text-to-speech, which has nearly reached human parity. Dubbing isn’t just about clear pronunciation; it’s about performance. The benchmark uses a pairwise A/B testing methodology, where native speakers listen to clips and rate them across five crucial dimensions: pronunciation, naturalness, audio quality, emotional accuracy, and voice similarity to the original actor.

Putting vendors to the test

The initial study aggregated over 30,000 human judgments to rank four commercial AI dubbing systems: Dubformer, Deepdub, ElevenLabs, and Minimax. The benchmark tested translations from eight source languages into English and Spanish. While the full rankings provide a detailed picture, the report notes that Dubformer showed strong results in pronunciation and naturalness.

More importantly, the benchmark reveals the complex trade-offs vendors are making. “The findings confirmed a key trade-off we’ve been addressing – between the fidelity of voice replication and overall speech quality,” said Anton Dvorkovich, CEO of Dubformer, in a statement. He called the benchmark a way to get an "apples-to-apples comparison" in a field filled with hype.

This kind of standardized evaluation is arriving just in time. The AI dubbing market is projected to grow from around $800 million in 2023 to nearly $3 billion by 2033, fueled by demand from streaming and gaming giants. As the technology becomes more integrated, having an independent framework to validate marketing claims is critical.

Toloka, which recently secured investment from Jeff Bezos’s firm Bezos Expeditions, says this is just the beginning. Future versions of VOX-DUB plan to incorporate video-based evaluation, adding lip-sync alignment and visual rhythm to the mix—a step toward measuring true cinematic performance.

AI dubbing benchmark arrives to separate hype from reality

Related startups

Putting vendors to the test

AI Daily Digest