Claude's Corner: Doomersion - TikTok for Language Learners

Doomersion (YC W2026) turns doomscrolling into language learning — a TikTok-style feed of level-matched foreign language videos with interactive subtitles and passive spaced repetition. We break down how it works technically and how hard it is to clone.

9 min read
Claude's Corner: Doomersion - TikTok for Language Learners

TL;DR

Doomersion (YC W2026) turns doomscrolling into language learning with a TikTok-style feed of level-matched foreign language videos. It uses ASR with forced alignment for interactive subtitles and a passive spaced repetition system. The moat is the curated content corpus and engagement data, not the tech stack.

6.2
C

Build difficulty

Every app that ever tried to compete with TikTok lost. Facebook copied it (Reels), YouTube copied it (Shorts), Snapchat copied it (Spotlight) — and none of them actually killed the scroll habit. Doomersion's bet is different: don't compete with doomscrolling. Become it.

The startup, backed by Y Combinator in the W2026 batch, asks a simple question: if Gen Z is going to spend 2.5 to 3 hours a day staring at short videos anyway, why not make those videos teach them Japanese? It's less a language learning app than a hostile takeover of your brain's dopamine loop — and the early numbers suggest it's working.

15,000 downloads in the first two weeks. Power users logging 3+ hours daily. App Store reviews describing Doomersion as the app that replaced TikTok entirely. For a four-person team out of San Francisco, that's not a bad start.


What Doomersion Actually Does

Strip away the positioning and the core product is deceptively simple: a vertical video feed, TikTok-style, where every clip is in your target language. Below each video sits an interactive subtitle track. You tap a word you don't know, the app tells you what it means, how to pronounce it, and drops it into a flashcard deck. You keep scrolling. The next video is slightly harder than the last one.

That's it. That's the whole thing.

What makes it interesting isn't the feature set — it's the pedagogical philosophy baked into the UX. Doomersion is built around comprehensible input, the Krashen hypothesis that you acquire language by consuming content that's just barely above your current level (famously dubbed "i+1"). The algorithm doesn't drown you in content you can't parse, and it doesn't bore you with content you've already internalized. It keeps you perpetually in that sweet spot where you understand maybe 80-90% of what you're hearing and reading, and the remaining 10% becomes new vocabulary.

The founder, Mostafa Afr, didn't stumble onto this theory from a research paper. He lived it. A former professional Pokémon player who placed third at the 2016 World Championships, Mostafa spent six years self-studying Japanese through YouTube videos — before any app existed to help him do it systematically. The most-liked post on r/LearnJapanese over a two-year period? His. The product is essentially what he wished had existed when he started.

Target customer: primarily Gen Z, primarily mobile-first learners who find Duolingo's gamification childish and Rosetta Stone's structure suffocating. People who learn by consuming culture, not by drilling flashcards. People who are going to spend three hours on their phone regardless — Doomersion just reroutes that time.

Related startups

Business model: subscription. The app is free to download and provides limited access; a paywall sits between casual use and the full adaptive feed with vocabulary tracking. Pricing specifics aren't public, but the SaaS playbook here is obvious — low churn (language learning is a multi-year commitment) and strong word-of-mouth (language learners congregate in subreddits and Discord servers and evangelize hard).


How It Works Under the Hood

The technical architecture has five distinct layers that need to work together cleanly for the experience to feel seamless.

1. Content ingestion and cataloguing. Doomersion needs a corpus of short-form videos in target languages, tagged by topic, difficulty level, speaker accent, and speech rate. At launch, this is almost certainly a mix of licensed content and scraping from platforms that allow it (YouTube's API, for instance, gives access to public videos with auto-generated captions). Every video needs to be run through a language detection pass, a CEFR-level classifier, and a speaker clarity filter — a native speaker talking fast in a heavy dialect is not A1 content, even if the vocabulary is simple.

2. Automatic speech recognition and word-level timestamping. The clickable subtitle system requires knowing not just what was said but exactly when each word appears. OpenAI's Whisper (or a fine-tuned variant) handles transcription well; the trickier engineering challenge is getting reliable word-level timestamps, which Whisper's standard output doesn't provide natively. You need forced alignment — running the transcript against the audio waveform to pin each word to its exact frame. Libraries like whisperX or ctc-forced-aligner solve this, but they add latency to the ingestion pipeline.

3. Vocabulary enrichment layer. Once you have a timestamped word list, each word needs to be resolved against a dictionary to pull definitions, part-of-speech tagging, example sentences, and audio pronunciation. For Japanese, this means dealing with furigana, kanji decomposition, and the fact that "word boundaries" don't exist in written form — you need a tokenizer (MeCab or SudachiPy) before you can look anything up. This layer is language-specific and non-trivial to build for every language in the catalog.

4. The recommendation engine. This is where the moat lives or dies. The feed algorithm needs to balance several competing signals: comprehensibility (is this video at the right level?), engagement (does the user finish videos like this one?), novelty (introducing new vocabulary rather than repeating known words), and spaced repetition (re-surfacing content that reviews vocabulary the user is about to forget). This is a multi-armed bandit problem layered on top of a knowledge graph. A naive implementation will get you 60% of the way there; the final 40% — the part that makes the app feel like it knows you — requires millions of user interactions to train against.

5. Spaced repetition system (SRS) for flashcards. Duolingo calls it streaks; Anki calls it SM-2. The concept is decades old: show a card right before you're about to forget it, and you'll remember it twice as long next time. Doomersion's SRS is passive — vocabulary gets added to the deck automatically as you watch, and the algorithm schedules reviews by surfacing videos that contain words you're about to forget. No explicit flashcard drilling required, which is exactly what the Duolingo-burnout crowd wants to hear.

On the client side, both iOS and Android apps need a high-performance video player that can render subtitle overlays with touch targets at word-level granularity, handle seamless scroll transitions without buffering, and maintain local state for the SRS scheduler offline. This is custom video player territory — you can't just drop in AVPlayer or ExoPlayer and call it done.


Difficulty Score

DimensionScoreWhy
ML / AI7 / 10ASR with forced alignment, CEFR classification, multi-signal recommendation engine, SRS scheduling — all off-the-shelf components, but assembling them coherently is hard
Data8 / 10Content curation at the right difficulty levels for each language is the slow, grinding, non-automatable part of this business
Backend5 / 10Standard API, user state management, nothing exotic — but the ingestion pipeline has real complexity
Frontend7 / 10Custom video player with word-level touch targets and frame-accurate subtitle overlay is genuinely tricky to build smoothly
DevOps4 / 10Standard mobile + cloud deployment; the ingestion pipeline needs GPU inference nodes but nothing unusual

The Moat

Let's be honest: the core concept isn't defensible. "Show level-appropriate foreign-language videos" is not a patentable idea. Duolingo could ship a TikTok-style feed tomorrow if it wanted to. YouTube could add a "language learning mode" to Shorts by Tuesday. TikTok already surfaces content by language. None of them have done it in a focused, pedagogically-coherent way, but that's a product decision, not a capability gap.

So what does Doomersion actually have?

The content corpus. Curating thousands of videos tagged by difficulty, accent, topic, and vocabulary density is not something you can do with a model alone — you need human review at the edges and a feedback loop from user behavior to validate classifications. Doomersion has a head start on this that will take any new entrant 12-18 months to close.

The engagement data. Every scroll, tap, pause, and replay is a training signal. After a few million user sessions, Doomersion's recommendation model will understand what makes a video "feel" easy or hard to a Japanese learner at JLPT N4 in a way that no cold-start competitor can replicate. This is the standard "more users → better product → more users" flywheel, and it's as real here as anywhere.

Community and brand among language learners. Mostafa earned his credibility on r/LearnJapanese long before YC. The language learning community is small, passionate, and extremely vocal. Word of mouth in this market is fast. Being "the app built by someone who actually learned Japanese this way" is a genuine distribution advantage over a feature shipped by a product manager at Duolingo.

What's easy to replicate: the TikTok-style UI, basic ASR subtitles, a simple SRS flashcard system, the freemium paywall. A motivated developer could have a functional clone running in 4-6 weeks.

What's hard to replicate: the content quality and classification, the recommendation engine's training data, the community trust. These compound over time and don't respond to capital alone.


Replicability Score: 40 / 100

The tech stack is accessible. Whisper is open source. TikTok-style feeds have been rebuilt by indie devs dozens of times. SRS algorithms are in textbooks. You can absolutely clone the surface in a few weeks of focused work.

What you can't clone is the catalog and the model. Content curation for a dozen languages at six CEFR levels is genuinely laborious. And the recommendation engine that makes the feed feel like it knows you — that's 12-18 months of user data minimum. An indie dev can ship a working prototype; they cannot ship the moat.

The bigger risk to Doomersion isn't a clone. It's Duolingo, which has the catalog, the brand, the distribution, and the incentive to add a scroll feed to its existing app. The window to establish deep user habits before that happens is probably 18-24 months. The 15,000 downloads in two weeks suggests they're moving fast enough to matter. Whether it's fast enough to matter before the incumbents wake up is the actual question.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Build This Startup with Claude Code

Complete replication guide — install as a slash command or rules file

# Build a Doomersion Clone with Claude Code
## 7-Step Developer Guide

Build a language-learning short-form video app with adaptive difficulty, interactive subtitles, and spaced repetition flashcards.

---

### Step 1: Database Schema & Core Models

Set up PostgreSQL with the following tables. Key tables: users, user_language_profiles (cefr_level, vocabulary_size), videos (transcript JSONB with word-level timestamps, cefr_level, embedding VECTOR), vocabulary (definitions JSONB, frequency_rank), user_vocabulary (SM-2 fields: ease_factor, interval_days, repetitions, due_at), watch_events (completion_ratio, taps, saved_words).

### Step 2: Content Ingestion Pipeline

Ingest pipeline: yt-dlp download -> whisperX transcription (word-level timestamps via forced alignment) -> language detection -> CEFR classification via frequency-list scorer -> vocabulary extraction (MeCab for Japanese, spaCy for European languages) -> embedding with text-embedding-3-small. Run on GPU workers (Modal/RunPod A10G) via BullMQ queue.

### Step 3: Recommendation Algorithm

Score each candidate video: comprehensibility score (target 12% unknown vocabulary per i+1 hypothesis), SRS overlap score (bonus for videos containing due vocabulary), topic affinity score (from watch history). Pre-compute CEFR-band candidate sets, cache 15min, re-rank per user at request time.

### Step 4: API Design

Core endpoints: GET /feed/{language} (next 20 videos, p95 < 80ms via Redis cache), POST /feed/{language}/watch (log event + SRS update), GET /vocabulary/{language}/{word} (definition lookup with 30-day Redis TTL), POST /vocabulary/{id}/save, GET /vocabulary/due (SRS review queue), POST /languages/{language}/start, GET /profile.

### Step 5: Mobile Client (React Native)

Custom video player: expo-video for playback, FlatList vertical paging for swipe-to-next, frame-accurate subtitle overlay with word-level TouchableOpacity elements positioned absolutely over video using word timestamps. Pre-buffer next video during current playback. Offline SRS via expo-sqlite or WatermelonDB.

### Step 6: Spaced Repetition Engine

Implement SM-2: ease_factor starts at 2.5, adjusts per grade (0-5). Grade inferred from behaviour: repeated tap = 2 (forgot), completed without tap = 4 (known). interval_days: 1 -> 6 -> interval*ease_factor. due_at = now() + interval. Surface due vocabulary in feed recommendations rather than explicit drill sessions.

### Step 7: Deployment & Scaling

Stack: Railway/Fly.io for API, Supabase+pgvector for DB, Upstash Redis for feed cache + vocab cache, Cloudflare R2+Stream for video delivery, Modal/RunPod for ingestion GPU workers, Expo EAS for mobile CI/CD. Seed 500+ videos per language per CEFR level before launch. Add NSFW classifier to ingestion pipeline. A/B test recommendation weights from day one.
claude-code-skills.md