Cohere Transcribe Sets New Speech Recognition Bar

Cohere launches Transcribe, an open-source speech recognition model setting new accuracy benchmarks and designed for enterprise-grade performance.

3 min read
Screenshot of Cohere Transcribe performance data on HuggingFace Open ASR Leaderboard.
Image credit: StartupHub.ai

Cohere has entered the speech recognition arena with the launch of Cohere Transcribe, an open-source model aiming to redefine automatic speech recognition (ASR) accuracy. This move signals a significant push into enterprise speech intelligence, positioning the technology as a foundational component for applications ranging from meeting transcription to real-time customer support. The model is available today for download, marking a new benchmark in the field.

The core objective behind Cohere Transcribe was to minimize word error rate (WER) under practical conditions, moving beyond research artifacts to deliver a production-ready system. It boasts a manageable inference footprint suitable for local GPU utilization and offers best-in-class serving efficiency. For those seeking a managed solution, Transcribe is also accessible via Cohere’s Model Vault platform.

According to Cohere, Transcribe currently holds the top spot on HuggingFace’s Open ASR Leaderboard. This achievement underscores its capabilities across diverse real-world scenarios, including multi-speaker environments and various accents. The company emphasizes that these gains are validated through both benchmark datasets and human evaluations, ensuring reliable performance translates from controlled tests to practical enterprise settings.

Model Performance and Architecture

Cohere Transcribe, built on a conformer-based encoder-decoder architecture, utilizes a 2 billion parameter model. It processes audio waveforms into log-Mel spectrograms to generate transcribed text. The model was trained from scratch on 14 languages, including major European, APAC, and MENA languages.

In English speech recognition accuracy, Cohere Transcribe leads with an average WER of 5.42%, outperforming prominent competitors like Whisper Large v3 and ElevenLabs Scribe v2. This advanced accuracy benchmark is crucial for applications demanding high fidelity, as discussed in articles on fine-tuning speech-to-text.

Beyond accuracy, Transcribe delivers exceptional throughput. It extends the Pareto frontier by achieving state-of-the-art accuracy while maintaining high real-time factors (RTFx) within its model size cohort. This balance is critical for production environments where latency and efficiency directly impact user experience and operational costs.

Radical Ventures, an early partner, expressed strong satisfaction with Transcribe's speed and reliability, noting its potential to unlock new real-time product possibilities.

Cohere plans deeper integration with its North AI agent orchestration platform, evolving Transcribe into a more comprehensive enterprise speech intelligence solution.

The model is available for download on Hugging Face and via Cohere's API. Production deployments can leverage Model Vault for private cloud inference without infrastructure management.

The development of such advanced open-source ASR models is a rapidly evolving space, with projects like Meta’s Omnilingual ASR and OLMoASR also pushing the boundaries of what's possible in open-source ASR models.