Beyond "Just Call an LLM": Vimeo's 7-Layer Architecture for Production Video Subtitles and Dubbing
17th December 2025
👋 Hello there, we’re Dave, Josh, and Anu. Welcome to AI in Production Field Notes, where we provide real-world lessons from teams deploying enterprise-grade AI systems.
Each week, we unpack a real-world case study, focusing on the tools, frameworks, and implementation patterns used in production. No fluff.
🔥 The Problem
You’ve got transcription working, complete with timestamps. You press publish on the Spanish version. Then you realize the marketing team’s 2-minute promo is now 3 minutes in German because syllable-timed languages pack more phonetic content into the same physical time. Your dubbed audio is sped up to cartoon territory, and your subtitles are half a second out of sync with what viewers hear.
Your stakeholders ask: “Can’t we just use an LLM?” Yes. But production systems at scale require more. Let’s explore.
Why This Matters ( & Where it Breaks in Production)
The moment you scale from 10 videos to 10,000, every edge case becomes operational debt.
The real cost isn’t just redoing the work. It’s streakholder trust. At Vimeo, one in three feature requests is for subtitles. But demand is for high-quality, correctly synchronized subtitles that don’t distract from the message and don’t require any overhead.
This week, we break down how Tanushree Nori and her team at Vimeo execute a production grade system to generate reliable subtitles and dubbed audio at scale delivering trust to customers and stakeholders. Enjoy!
📊 By The Numbers
200+ e-learning videos translated to Hindi and Arabic in weeks, not months
5 deployment configurations (chunking strategies, model selection, validation rules) tested per language pair
240% length overruns observed when translating English to German without phonetic constraints
7 discrete components in Vimeo’s subtitle orchestration pipeline (each solving a specific failure mode)
3 evaluation metrics built using LLM-as-judge with structured reasoning for quality gates
Source: Tanushree Nori, Principal Data Scientist, Vimeo | Toronto Machine Learning Summit 2025
🔥 So What Breaks? “The AI Iceberg Problem”
LLMs can translate. Everyone knows this. Prompt“Translate this to Spanish” and Spanish text appears. Done. But at scale, the reality looks nothing like the demo.
What’s hidden under the surface:
Keeping this in mind, let’s explore the framework used.
🧠 Solution: Vimeo’s subtitle translation pipeline
Vimeo’s approach treats translation not as a single LLM task, but as a sequence of well-defined problem solutions. Each layer addresses a specific failure mode discovered in production. Remember, at scale it gets complicated, but if you can figure it out you’ll have earned trust and production credibility .
The system runs 5-7 separate validation and retry loops, with parallel execution throughout. Let’s explore each.
1️⃣ INTELLIGENT CHUNKING
What it solves:
The transcript arrives as a blob. You can’t dump hours of content into an LLM—context window quality degrades with volume. Naive chunking (hard splits at 10 seconds, 30 seconds, or by token count) destroys translation quality by fragmenting semantic context.
Intelligent chunking solves this by splitting at natural boundaries while preserving conversational coherence.
Techniques:
Sentence-aware boundaries — Never cut mid-sentence. This preserves idiomatic translations instead of forcing literal word-for-word output.
Speaker-aware cuts — In multi-speaker transcripts, keep one speaker’s complete turn within a single chunk. If Speaker A starts a thought in Chunk 1 and continues in Chunk 2, the LLM loses dialogue register and produces inconsistent pronouns, tone, and word choice.
Language homogeneity — Rare but real: some transcripts have code-switching (multiple languages). Keep each language isolated within chunks. Don’t split a sentence between English and Spanish in two separate chunks.
Real example from the talk:
Vimeo tested naive time-based splits on e-learning videos. When a sentence was split midway (”the server” in one chunk, “timed out” in the next), German translations appeared garbled and disconnected. Switching to speaker- and sentence-aware boundaries on the same content improved quality. The key insight: treat chunking as a first class citizen, not a commodity.
Why naive chunking breaks:❌ Hard token limits → Sentences chopped mid-thought, causing fragmented context for LLM
❌ Time-based splits (every 10 seconds) → Speakers get cut off mid-sentence
❌ Lack of speaker awareness → Speaker’s dialogue divided into segments, losing the natural conversational tone.What works:
✅ Speaker-aware chunking → Keep full speaker turns intact (preserves tone, pronoun consistency)
✅ Language-aware chunking → Maintain homogeneous language segments if the transcript contains multiple languages.
✅ Punctuation-aware splits → End chunks at natural sentence boundaries and pauses.
✅ Semantic coherence → Complete thoughts stay togetherKey insight: Treat chunking as a core feature, not a library you import. Optimize it iteratively based on translation quality metrics.
2️⃣ PHONETIC-GUIDED REWRITING
What it solves:
German, Italian, and Spanish are syllable-timed languages. Japanese is mora-timed. English is stress-timed. They encode the same thought in different physical durations. When you translate “Keep it up” to German, the result is often 240% longer phonetically. Your dubbing pipeline then either truncates (losing meaning) or speeds up playback (unwatchable).
This layer detects length overruns and reruns translation with numerical constraints.
Techniques:
Phoneme comparison — Extract phonetic representations of source and target text. Measure the ratio: target phonemes/source phonemes. Flag anything >120% as a rewrite candidate.
Numerical length constraints — Don’t ask the LLM “Please make it shorter.” Say “reduce by 25% relative to the source.” Numerical targets outperform vague instructions.
Retry loop with backup models — If phonetic constraints fail on a cheaper model, escalate to a more expensive one (e.g., Claude → GPT-4o).
Real example from the talk:
Vimeo’s system flagged a German translation of a short English segment as 240% longer phonetically. The prompt was reissued: “Reduce this translation by 30% while preserving meaning.” The second attempt fit the time window. Without this loop, the dubbed video would play at 1.25x or 1.5x speed, creating a broken user experience.
3️⃣ VALIDATION LOOP WITH SCHEMA ENFORCEMENT
What it solves:
LLMs are stochastic. The same prompt produces different outputs. Early Vimeo experiments used JSON prompts (”please return valid JSON”). Subtle errors crept in: extra commas, mismatched quotes, stray characters. Production broke silently.
This layer uses structured outputs and formal validation to eliminate surprises in output formats.
Techniques:
Function calling over JSON prompts; Use the LLM’s native structured output API (e.g., Claude’s tool_use, GPT-4o’s response_format). Define the schema as formal function parameters, not as strings in the prompt. This is non-negotiable for production.
Validation function passed to execution; The validation logic lives inside the orchestration function (not in separate monitoring code). If validation fails, an immediate retry is triggered before the chunk moves downstream.
Two-tier configuration; First attempt uses a cheap, fast, aggressive model. If validation fails, retry with an expensive, conservative model. The system guarantees production-ready output.
Real example from the talk:
Early experiments with JSON prompts failed 2-3% of the time on large batches. Stray backslashes, extra quotes, missing fields. Switching to function calling brought failure rate to near zero. Over thousands of daily videos, this 2-3% becomes hundreds of broken subtitles.
Eliminate Ambiguity with Strict Schemas❌ Wrong: “Please give us JSON output.”
✅ Right: Define JSON schema explicitly. Use function calling. Pass the schema as a formal parameter to LLM’s response object.
4️⃣ PROMPT ENGINEERING WITH HARD TASK DECOMPOSITION
What it solves:
Instead of writing: “Please translate this to Spanish, consider tone, accuracy, and timing,” structure the prompt as a series of discrete steps the LLM executes sequentially.
Techniques:
Chain-of-thought as pseudo-code: Break the task into named steps: (1) Translate to German, (2) Extract phonetic representation, (3) Compare phoneme counts, (4) If count exceeds 120%, shorten; else approve. This reads like pseudocode, not rambling instructions.
Structured reasoning within a single prompt — For some tasks, it’s better to keep everything in one LLM call (cheaper, faster). Use structured reasoning: ask for a list of problems, score each problem, and provide a final score. Even though you don’t use the intermediate outputs, the reasoning improves final output quality.
Prompt class versioning: Treat prompts as versioned software. Instead of editing prompt strings scattered across notebooks, use a Prompt class with attributes (register=”formal”, language_count=3, speaker_aware=True). All instances inherit those attributes. Changes propagate automatically.
Real example from the talk:
Translate to German
Generate a phoneme representation of German
Generate a phoneme representation of the English source
Compare how far apart they are
If far apart → shorten
If not → proceed
Split into lines based on initial transcript timestamps
❌ Wrong: “Think step by step” / “Think deeply” / “Pause and reflect” (vague urging)
✅ Right: Write it like pseudocode. Clear execution plan.
5️⃣ CUSTOM VOCABULARY (CONTEXT INJECTION)
What it solves:
Your source transcript has “Boeing 747” three times. The speech-to-text model transcribes it as “Boeing seven forty seven” once and “Boeing 747” twice. Consistency is shattered before translation even begins. Then the LLM translates these variants differently.
This layer provides a custom vocabulary dictionary for both transcription and translation steps.
Techniques:
Entity mapping at transcription: Pass a list of entity variations (Boeing 747, Boeing seven forty seven, B747) to the speech-to-text model. The model corrects inconsistencies post-transcription.
Same mapping at translation: The same entity list is appended to the translation prompt context. The LLM treats all three as a single entity and translates consistently.
Client-provided glossaries: For B2B use cases, allow clients to upload brand-specific terms, product names, and proper nouns that should never be translated.
Don’t assume LLMs understand niche terminology. Provide glossaries directly in the context window.
Real example from the talk:
Vimeo client had a company name “Aeron” that the speech-to-text model sometimes rendered as “AERON,” “Aeron,” and “A-R-O-N.” Without intervention, the Spanish subtitles had three different variations. Adding Aeron to the custom vocabulary fixed it at transcription and reinforced it at translation.
Why it works: Prevents variability from how words are pronounced. Fixes issues before they flow downstream.
6️⃣ PARALLEL EXECUTION WITH ASYNC ORCHESTRATION
What it solves:
Chunking produces N chunks. Each chunk requires translation, validation, phonetic checking, and, if needed, a retry. Running these sequentially is impossibly slow. This layer executes chunks in parallel while preserving error handling and backoff logic.
Techniques:
Async over multi-threading: Use async/await, not OS-level threads. Async has better timeout handling and integrates cleanly with external LLM APIs.
Execution wrapped in a single function: Instead of having validation occur elsewhere, orchestration takes place within a single function. Validation fails → immediate retry with a new config. All retries stay within the same logical unit.
Conversational retries: Don’t just re-run the same prompt. Turn it into a multi-turn conversation: “I got [output]. The problem is [issue]. Here’s the target: [goal]. Try again.”
Return exceptions list, don’t throw: In production, crashing on one chunk breaks the whole job. Return a list of exceptions for each chunk. Downstream can decide: retry failed chunks, flag for review, or use a fallback.
Real example from the talk:
Vimeo processes 10,000+ videos daily. Sequential chunk processing: 10,000 * 50 chunks * 2-5 seconds per LLM call = hours. Parallel async: ~5 minutes for the same batch. Conversational retries (rather than blind reruns) improve quality by 8-12%.
7️⃣ EVALUATION FRAMEWORK WITH LLM-AS-JUDGE
What it solves:
You change the chunking algorithm. Does output quality improve or degrade? You introduce phonetic constraints. Do timestamps improve? You switch models. How does accuracy compare? Manual review of 100+ clips is untenable.
This layer uses LLM-as-judge with structured reasoning to automatically evaluate hundreds of videos.
Techniques:
Structured reasoning in evaluation: Don’t ask the LLM to score 1-10. Ask it to: list all problems, score each problem, then provide a final score. The intermediate reasoning improves accuracy.
Accuracy metric: Core measure. LLM checks: Does the translation preserve meaning from the source? Any information loss?
Correspondence metric: Do timestamps align? Does the translated text fit the audio window?
Phoneme count metric: Is the target language overrun? Structured reasoning checks: source phoneme count, target phoneme count, ratio, within threshold?
Back-translation validation: Translate source → target → source again. Compare the original and the back-translated text. Any divergence indicates problems in the forward translation.
Bulk evaluation with HTML and CSV export: Run experiments across 50 languages, 3 models, 100 clips. Export results as both human-readable HTML (visual inspection) and CSV (metrics and aggregation).
Real example from the talk:
Vimeo noticed that some language pairs (English-to-Japanese, English-to-Korean) mysteriously merged separate English sentences into single target sentences. Running bulk evaluation with LLM-as-judge across 500 clips revealed the pattern. Structured reasoning showed: “Source has two sentences separated by period. Target has one sentence joined by a conjunction word. Information is preserved but structure is altered.” This insight drove fixes to speaker-aware chunking.
🎯 The Decision Framework (Use This)
Step 1: Start With a Direct LLM Call
Test your use case with a simple prompt. No orchestration. Just:
“Translate this to [target language]: [your text]”
If quality is acceptable and timing doesn’t matter: Stop. Ship it.
Step 2: Add Schema Validation
If the output format is inconsistent (sometimes JSON, sometimes broken), add:
Strict schema definition
Function calling
Validation loop with retry
If this fixes reliability: Stop here.
Step 3: Add Chunking Intelligence
If long-form content produces poor translations:
Implement speaker-aware chunking
Add language-aware divisions
Ensure semantic coherence
Test: Does output quality improve?
Step 4: Add Timing Synchronization
If you’re creating subtitles/dubbing:
Implement phoneme-guided rewrites
Add numerical length constraints
Build coalesce + timing check step
Test: Are subtitles in sync with audio?
Step 5: Build Evaluation Framework
If you’re iterating on quality:
Bulk testing across clips/languages
Reverse translation validation
LLM-as-judge with structured reasoning
Test: Can you quantify improvements?
Step 6: Implement Language-Specific Fixes
If specific languages consistently encounter issues:
Japanese/Korean sentence merging
German length expansion
Custom vocabulary for entities
Test: Do language-specific adjustments improve these languages without causing regressions in others?
✅ When This Approach Actually Works
This architecture makes sense when:
1. Quality must be production-grade
50% accuracy doesn’t cut it. Customers notice broken subtitles immediately. You need a 90%+ success rate before shipping.
2. Scale matters
Thousands of videos daily. Dozens of languages. You can’t manually QA everything. Automation + evaluation framework required.
3. Timing is non-negotiable
If your use case doesn’t care about synchronization (e.g., document translation), you can skip phoneme-guided rewrites. But for subtitles/dubbing, timing sync is mandatory.
4. Multi-language support is critical
If you’re only supporting English→Spanish, you can hardcode fixes. But supporting 20+ languages? You need generalizable patterns (reverse translation validation, structured reasoning, LLM-based phoneme counting).
5. Cost vs quality tradeoff exists
Cheap model first try (fast, affordable, works 70% of the time)
Expensive model on retry (slower, costly, works 95% of the time)
This two-tier approach optimizes for cost while maintaining quality.
❌ When This Is Overkill
Don’t build this if:
1. Single-language, low-volume
If you’re translating 10 documents/day into a single language, use Claude/GPT directly. Don’t build orchestration.
2. Timing doesn’t matter
Translating blog posts? Marketing copy? No audio sync needed? Skip phoneme-guided rewrites. Just validate translation quality.
3. You can’t build an evaluation infrastructure
Without evaluation, you’re flying blind. You can’t iterate. You don’t know if changes improve quality. If you can’t invest in a robust eval framework, stick with simpler approaches.
4. Manual QA is acceptable
If your volume is low enough that humans can review every translation, you don’t need LLM-as-judge. Manual review is fine.
📚 Key Takeaways
1. A Single LLM call never works at scale
Demo: works. Production: breaks. Orchestration accounts for 95% of the work.
2. Chunking is a core feature, not a commodity
Optimize it iteratively. Speaker-aware, language-aware, and semantically coherent chunks dramatically improve output.
3. Numerical constraints > vague instructions
“Reduce by 20%” works. “Please make it shorter” doesn’t.
4. The evaluation framework is non-negotiable
Without it, you can’t iterate. You don’t know if changes improve quality.
5. Structured reasoning improves LLM-as-judge accuracy
“List problems → Score problems → Final number” outperforms direct scoring.
6. Async > multi-threading for parallel execution
Improved timeout handling. Critical for production reliability.
🎯 Who benefits?
Marketing Teams
Example: Marketing teams can upload one video and run multi-region campaigns
The effect: eliminates the need for pricey localization budgets.
E-Learning Providers
Real example from Vimeo: Client converting 200+ e-learning courses to Hindi and Arabic
The effect: unlocked two new markets, added accessibility features (subtitles for hearing-impaired learners), all automated.
Executives and Leaders
Real examples: Executives of a global company need to provide updates to employees in their native languages.
The effect: Enables clearer understanding of communication at scale across organizations.
The common thread: These users demand speed, accuracy, and scale without the costs and bottlenecks of manual translation teams.
🚀 Vimeo’s Roadmap
Translation Fluency Metrics
Current focus: Preserving information. Next: enhancing fluency quality.
Even when translations are accurate, they can sound awkward. Building metrics to detect and fix this.
Heavy Use of Conversational Retries
Expand chat-based retry patterns to cover more potential failure modes.
Text-to-Text Translation Benchmarks
Now that they support all languages, begin applying standard benchmarks for those languages.
Solve Sentence Contiguity Issues
Address Japanese/Korean sentence-merging issues. Improve with language-specific chunking.
Name Translation Strategy
Some names are translated into specific languages (Voldemort), others are not. Complex issues require a systematic approach.
📚 Resources & Tools
LLM APIs Used:
→ Claude and GPT-4o with structured outputs and function calling
Languages Supported:
→ Over 50 language pairs (specific list not provided in talk)
Evaluation Techniques:
→ Back-translation validation — Translate target → source, compare with original
→ Phonetic/phoneme extraction — Language-agnostic approach using LLM reasoning
Development Patterns:
→ Async orchestration — Better timeout handling than multi-threading for LLM calls
→ Prompt versioning as code — Treat prompts as software with class-based attributes
→ Iterative repair loops — Conversational retries (not blind re-runs) improve quality 8-12%
No external frameworks (LangChain, LangGraph, DSPy) used for this pipeline. The team built custom orchestration because production requirements (multi-language support, custom vocabulary, phonetic constraints) aren’t well-served by general-purpose libraries.
📹 Watch The Full Talk
👉 Watch: Toronto Machine Learning Summit 2025 – Subtitle and Dubbing Engine
Runtime: 32 minutes | Level: Intermediate/Advanced
Best for: ML engineers building translation systems, teams dealing with multi-language production pipelines
Key sections:
- 00:00-08:00: Problem context and customer use cases (e-learning, executive comms, global marketing)
- 08:00-18:00: Why naive chunking fails and speaker-aware boundaries matter
- 18:00-28:00: Phonetic-guided rewrites for syllable-timed languages
- 28:00-40:00: Prompt engineering techniques (task decomposition, function calling, structured reasoning)
-40:00-50:00: Evaluation architecture with LLM-as-judge
- 50:00-60:00: Q&A on entity consistency, custom vocabulary, and real-time challenges
✉️ Next Week
Self-healing ML systems in production at Discover Financial Services are the focus as Kamal Singh Bisht, Principal Application Engineer, explains how autonomous MLOps pipelines detect drift, recover from failures, and maintain model health without constant human intervention. The session covers practical (proven) patterns for observability, automated rollback, and retraining.
📤 Worth Sharing?
Know somebody building translation features? Know someone debugging why their LLM translations work in notebooks but break in production?
Forward this → they’ll avoid months of trial and error.
📹 Additional: Things that caught our attention this week;
Measuring Agents in Production Arxiv Paper
306 practitioners and conducting 20 in-depth case studies via interviews across 26 domains by UC Berkeley, Intesa Sanpaolo, UIUC, Stanford University and IBM Research
Thanks, and we’ll see you next week!
Written by Anupama Garani, David Scharbach, Josh Goldstein
AI in Production Field Notes | TMLS
Model Credit: This document was prepared with assistance from Claude, an AI language model developed by Anthropic.



