The journey of building AI-first companies often involves navigating periods of technological stagnation, colloquially known as "AI winters," a challenge Speak, a leading language learning platform, faced head-on years before the recent surge in large language models. While the world now marvels at the capabilities of tools like ChatGPT and Whisper, Speak’s co-founder, Hojoon Kim, revealed their strategic resilience and deep technical investments that allowed them to not only survive but thrive during less opportune times for AI.
Kim spoke with the host of Latent Space about the company’s foundational approach to speech AI, particularly its unique path in developing custom models for English language acquisition among non-native speakers. He elucidated how Speak’s core proposition was built on proprietary technology, predating the current public excitement around large language models. "The first act of the company, if you will, was before LLMs, right? Before 2022, when ChatGPT came out, when Whisper came out," Kim stated, emphasizing their early commitment to speech AI when the broader ecosystem was less mature.
Speak’s early focus on a specific market niche—English language learning for non-native speakers in South Korea—proved pivotal. This strategic choice allowed them to achieve product-market fit by addressing a critical, underserved need with specialized technology. Rather than waiting for general-purpose models, they embarked on building their own.
A key differentiator emerged from their user engagement. As users consistently spoke into the app, Speak amassed a unique and invaluable dataset. "We developed custom speech recognition models and users were speaking into the app all day so we had a ton of this non-native English speaker data," Kim explained. This continuous influx of data from non-native speakers provided a powerful feedback loop, enabling them to fine-tune their models for superior accuracy in their target demographic, creating a formidable data moat.
The technical demands of their core offering also dictated their approach. For real-time interactive lessons, latency was paramount. "It’s important for us for the core recording loop in many of our lessons that it’s extremely fast so we’re very latency sensitive," Kim highlighted. This necessity for speed and precision meant off-the-shelf solutions were often insufficient, reinforcing the need for their custom-built, optimized ASR (Automatic Speech Recognition) systems.
While Speak’s foundational technology remained robust, the advent of powerful new models like Whisper and larger language models presented new opportunities. Speak has strategically integrated these advancements for different aspects of their product. Their newer features, such as open-ended tutoring that provides semantic feedback on user speech, are "more Whisper powered, more LM powered." However, this adoption did not displace their original, tailored solution. They continue to leverage their "very fast core ASR loop that’s been fully custom," demonstrating a pragmatic hybrid approach to AI development. This blend of proprietary innovation and strategic adoption of external breakthroughs showcases a mature understanding of AI product development, balancing performance needs with evolving capabilities.

