NVIDIA has announced a significant push into multilingual AI, releasing a new open dataset called Granary and two accompanying models, Canary-1b-v2 and Parakeet-tdt-0.6b-v3. This initiative aims to address the critical lack of AI support for the vast majority of the world's languages, specifically targeting high-quality speech recognition and translation across 25 European languages, including those with historically limited data like Croatian, Estonian, and Maltese. The tools are designed to empower developers to build scalable AI applications for global users, facilitating advanced features in areas like multilingual chatbots, customer service, and real-time translation.
In an announcement on its blog, NVIDIA detailed Granary as a massive, open-source corpus comprising approximately one million hours of audio. This includes nearly 650,000 hours dedicated to speech recognition and over 350,000 hours for speech translation. The dataset, along with the new Canary and Parakeet models, is now publicly available on Hugging Face, with a research paper on Granary slated for presentation at the Interspeech conference in August.
A core challenge in developing robust speech AI has been data scarcity, particularly for less-resourced languages. NVIDIA, in collaboration with researchers from Carnegie Mellon University and Fondazione Bruno Kessler, tackled this by developing an innovative processing pipeline. This pipeline, powered by NVIDIA's NeMo Speech Data Processor toolkit, transforms unlabeled audio into structured, high-quality data suitable for AI training, circumventing the need for expensive and time-consuming human annotation. This methodology is also open-sourced on GitHub, allowing broader adoption.
Democratizing Speech AI for Diverse Languages
Granary's clean, ready-to-use data provides a substantial head start for developers building models for transcription and translation tasks across nearly all of the European Union’s 24 official languages, plus Russian and Ukrainian. For European languages that are typically underrepresented in human-annotated datasets, Granary offers a vital resource, enabling the creation of more inclusive speech technologies that better reflect the continent's linguistic diversity. Crucially, the research demonstrates that Granary requires roughly half as much training data compared to other popular datasets to achieve comparable accuracy levels for automatic speech recognition (ASR) and automatic speech translation (AST).
The new Canary and Parakeet models serve as practical examples of what developers can achieve with Granary. Canary-1b-v2, a billion-parameter model, is optimized for accuracy in complex tasks, offering transcription for European languages and translation between English and two dozen supported languages. It boasts quality comparable to models three times its size while performing inference up to ten times faster. Available under a permissive license, it significantly expands the Canary family's language support.
Complementing Canary is Parakeet-tdt-0.6b-v3, a more streamlined 600-million-parameter model. This model is engineered for high-speed, low-latency tasks, capable of transcribing 24-minute audio segments in a single inference pass. It automatically detects the input audio language, simplifying the transcription process. Both Canary and Parakeet models provide accurate punctuation, capitalization, and word-level timestamps in their outputs, enhancing the utility of their transcriptions. By sharing both the dataset and the underlying methodology, NVIDIA aims to accelerate innovation across the global speech AI developer community.

