RAG Isn’t Dead, It’s Just Rarely Built for Production
Production-ready RAG architectures, context orchestration, and decision frameworks with Rajiv (ex-Hugging Face). Issue #4 | Jan 16 2026
Your RAG demo works perfectly. You built it in an afternoon. Simple vector database, basic chunking, semantic search. You tested it with a few queries off the top of your head. Accuracy looks great.
Then you deploy to production.
Suddenly, accuracy drops to 60%. Latency balloons to 45 seconds. Scaling from 10000 documents to 100,000 breaks everything. Cost explodes. And, oh crap - you forgot about document permissions and compliance.
Your users ask questions that are completely different from those you tested. Your CEO types their name, expecting the system to recognize they’re the new CEO (it doesn’t; it’s trained on historical data). The “simple” demo that took an afternoon became a nightmare of embedding models, re-rankers, query re-writing, agentic loops, and metadata management.
Rajiv Shahj, who spent years at Hugging Face working directly with the researchers who wrote the original RAG papers, ran a deep-dive workshop at MLOps World 2025. He’s now at Contextual AI building production RAG systems. The insights? Most teams are overcomplicating retrieval while ignoring the fundamentals that actually matter.
📊 By The Numbers
100+ RAG variations - Research papers describe hundreds of different RAG flavors
BM25 still wins - Simple keyword search outperforms many neural models in production benchmarks
93% vs 76% accuracy - Agentic RAG with multiple reasoning loops dramatically improves complex queries
50 seconds vs 3 seconds - Agentic approach accuracy gain comes with massive latency cost
25 attempts — Why it’s called BM25 (Best Match 25th version—took 25 tries to get it right)
Source: Rajiv Shah, MLOps Engineer, Contextual AI (formerly Hugging Face) | @MLOps World 2025
🔥 The Demo-to-Production Gap
The classic RAG demo is dead simple:
Grab documents
Chunk them into pieces
Use embedding model for semantic similarity
Search and retrieve
Generate answer
Takes a few hours to build. Works great in your notebook.
Then reality hits.
Here’s how it breaks (and what you can anticipate)
1. Accuracy Collapse
You tested with a few hand-crafted queries. Real users ask completely different questions. Edge cases you never thought about. Phrasing variations you didn’t anticipate.
What you thought was 90% accurate drops to 60% when actual customers use it.
2. Latency Explosion
Your demo had tons of extra calls. Embedding model → retrieval → reranker → generation. Maybe some query reformulation. Maybe multi-hop reasoning.
Suddenly queries take 45 seconds when users want 3-4 seconds maximum.
3. Scaling Infrastructure
Demo: 100 documents
Production: 10,000 documents
Scaling the ML infrastructure, the vector databases, the compute - all harder than you thought.
4. Cost Reality
How many LLM calls are you making per query? How expensive is your embedding model? Your reranker? Your generation model?
Monthly bills surprise you.
5. Compliance and Permissions(The thing nobody talks about):
Not every document in a company can be read by everybody.
Your RAG needs entitlements. Permissioning. Document-level access control. User-level filtering.
Most tutorials skip this entirely. But it’s non-negotiable for enterprise deployment.
“These are the complexities that we often don’t hit when we build our own demo. But when you move it into production scale inside an enterprise, there are a lot of reasons why that same RAG demo doesn’t scale up very well.”
🧠 RAG Is a System, Not a Model
A common thought is: “I need a good embedding model.”
Wrong framing.
Effective RAG requires a complete system:
Document Processing:
How will you chunk documents? (Not as simple as fixed token counts)
How will you handle tables, images, and graphs?
How will you extract and inject metadata?
Querying:
Single-shot or multi-hop?
Query reformulation needed?
Agentic reasoning required?
Retrieval:
BM25 (keyword search)?
Semantic search (embedding models)?
Hybrid approach?
Reranking?
Generation:
Which LLM? (Cost vs quality tradeoff)
Context window size? (Don’t just stuff everything in)
Hallucination controls?
Before you jump to “which boxes should I have in my RAG system,” step back.
🎯 Understanding Your Use Case First
Every system has tradeoffs: latency, cost, and problem complexity.
Plus the often-ignored factor: cost of a mistake.
In healthcare, RAG, where doctors look up treatment info = A Hallucination could kill someone (high cost of mistake)
In human resources, RAG for checking vacation time → Wrong answer is annoying, but not deadly (low cost of mistake)
These require completely different architectures and safety mechanisms.
The Complexity Spectrum
Simple (Low Complexity):
Factual Q&A over a single document
Extraction: “What is Tesla’s total revenue?”
Fixed schema, predictable queries
Medium (Moderate Complexity):
Semantic variation: “How much bank did Tesla make?” (same question, different phrasing)
Multi-hop: Solve part 1, then use that info to solve part 2
Multi-document: Pull information from multiple sources
Complex (High Complexity):
Hidden assumptions: User expects RAG to know information NOT in the documents (laws, domain knowledge, context)
Agentic workflows: Multi-step planning, evaluation, analyst-level reasoning
Dynamic adaptation: Learn from one query, apply to next
“Understanding the types of queries people will use - simple extraction vs multi-hop vs agentic analyst replacement is very important in designing and thinking about what type of RAG system you want to build.”
Key insight: Talk to your actual end users. Understand what they expect. They often have domain knowledge in their heads that they assume your RAG will magically know.
Spoiler: It won’t. You need to add that to your corpus.
🔍 Retrieval Deep Dive: Three Approaches That Matter
Most of the confusion in RAG centers on retrieval. Let’s break down what actually works.
Approach 1: BM25 (Best Match 25)
What it is: Classic keyword search using inverted index.
How it works:
Take all words in your documents
Create inverted index (word → document mapping)
Query comes in → look up words → return matching documents
Example:
Search for “butterfly” → Index shows: Document 1
Fast. Simple. No neural networks needed.
Speed comparison (benchmark):
Linear search (control+F through every document):
100 docs: 0.5 seconds
1,000 docs: 5 seconds
10,000 docs: 50 seconds
BM25 (with index):
100 docs: 0.01 seconds
1,000 docs: 0.02 seconds
10,000 docs: 0.05 seconds
Massive speedup once the index is built.
The tradeoff:
❌ Exact word matching required
Query: “physician”
Documents use: “doctor”
→ No match found
Query: “International Business Machines”
Documents use: “IBM”
→ No match found
When it works:
✅ Strong baseline for any document search
✅ Users know exact keywords
✅ Technical documentation
✅ Legal documents
✅ Domain experts using system
Implementation: Python has BM25 library (easy to try on your documents)
“It’s a strong baseline pretty much for any type of document search where you have real humans that understand their material. If you have people that know the exact keyword they want, it works really well. This is why it’s the most widely implemented way we search.”
Critical insight: In many production RAG papers, BM25 beats neural models. Don’t skip this baseline.
Approach 2: Language Models (Semantic Search)
What it is: Neural networks that encode text into numerical representations (embeddings).
How it works:
Text → Embedding model → Vector (numbers)
Similar concepts cluster together in semantic space:
“doctor” and “physician” → close together
“IBM” and “International Business Machines” → close together
Why this matters: Solves the synonym/variation problem BM25 can’t handle.
The challenge: Which embedding model should you use?
Choosing Embedding Models: The MTEB Leaderboard
MTEB = Massive Text Embedding Benchmark - 23:36
300+ embedding models tested across multiple tasks.
Key tradeoffs (visualized on leaderboard):
Y-axis: NDCG (retrieval quality metric)
X-axis: Sentences per second (speed)
What you see:
Static embedding models (right side):
Super fast (faster than BM25)
Lower accuracy
One word = one embedding (uncontextualized)
Example: Word2Vec
Contextualized models (middle-left):
Slower
Much higher accuracy
Understand context: “The model was shooting in Paris” (fashion vs violence)
Why do different models of the same size perform differently?:
Same number of parameters, very different performance.
Reason: Better training data + improved architectures over time.
Qwen 3 embedding (latest, top performer) vs NV-Embed (older, same size) → Qwen 3 wins because:
Higher quality training data
Better architecture choices
Lessons learned from previous generations
Recent development: RTEB (Retrieval Embedding Leaderboard) - 23:36
New in the last 2 weeks
Has a private holdout set
Model builders don’t have access to test data
Prevents overfitting to the leaderboard
Practical considerations when picking models:
Model size:
Large models need GPUs
Small models can run on CPUs (faster deployment)
Training data:
Multilingual support
Domain-specific (healthcare uses terms differently than finance)
Consider fine-tuning for your domain
Context window:
How many tokens can model handle?
Longer context = more flexibility (but slower)
Matryoshka embeddings (recent development): 32.12
Variable dimension at inference time.
Normal embedding: Fixed 768 dimensions (always)
Matryoshka embedding: Choose 128, 256, 512, or 768 dimensions (at query time)
Use cases:
Full quality needed: Use 768
Want faster compute / save storage: Use 256
Model trained to maintain quality as you shrink
OpenAI now supports this in their API.
Sentence Transformers (Critical for RAG) 33.41
Most RAG systems use sentence-level embeddings.
Why: Documents are written in sentences. Models optimized for sentences perform better for retrieval.
Pretty much all production RAG embedding models use sentence transformers.
Cross-Encoders (Rerankers) - 34:10
What they do: Don’t just encode text—they look at query + document together and score the match.
Standard embedding:
Text → embedding (separate from query)
Query → embedding (separate from text)
Compare embeddings
Cross-encoder / Reranker:
Query + Text → score (how well do they match?)
Larger model (more compute)
Higher quality matching
How it’s used in production:
Stage 1: Fast embedding model gets top 100 candidates (cheap, broad net)
Stage 2: Reranker scores those 100 and picks best 5-10 (expensive, precise)
Why this pattern works:
Embedding models: Fast but less precise
Rerankers: Slow but very precise
Two-stage approach: Best of both worlds
The tradeoff: Extra latency. Reranking adds time.
But: Often worth it for 2-5% accuracy improvement.
Recent development: Instructable re-rankers
Can prompt the reranker with instructions.
Example: “Prioritize most recent documents”
Gives another way to control reranking behavior.
Hybrid Search (The Practical Baseline)
Combine multiple approaches:
BM25 (lexical / keyword search) + Semantic embedding search
Reciprocal Rank Fusion (RRF): Merge the two results lists intelligently
Then optionally → Reranker for final polish
Why this is recommended:
✅ Easy baseline
✅ Well-supported by most vector databases
✅ Gets benefits of both keyword matching and semantic understanding
✅ Strong starting point before optimization
“An easy baseline that’s pretty widely recognized is just start with hybrid search—lexical + semantic—pass it into a reranker if you need that little bit of accuracy. It’s an easy thing to set up and it’s a good strong baseline.”
📊 The Surprising Truth: BM25 + Smart LLM Beats Everything
This shocked the community.
Bright Research Paper finding:
Tested multiple retrieval approaches on complex reasoning tasks.
Results:
Qwen embedding model + GPT-4: Good performance
E5 sentence transformer + GPT-4: Worse performance
BM25 + GPT-4: BETTER than both neural models
Translation: Old-school keyword search + reasoning model beats expensive semantic embeddings.
Why this works:
Instead of spending compute upfront creating semantic embeddings, just let GPT-4 handle semantic relationships on the fly.
Raj tested this himself on two datasets:
Wix QA (tech support):
Semantic embeddings + reranker: Good performance
BM25 + GPT-4: Nearly identical performance (slightly lower, but close)
Financial 10-K documents:
Semantic embeddings + reranker: Good performance
BM25 + GPT-4: On par
Latency cost: BM25 approach was ~10% slower (extra reasoning calls)
But: No embedding model needed. No reranker needed. Much simpler architecture.
Where this trend is emerging: Code search
Code search never worked well with language-based embedding models.
Now: Claude Code, Cursor, etc. → all using BM25 + reasoning model
Simple keyword search + smart LLM that iterates until it finds the right code.
The caveat:
If you need fast responses (3 seconds), semantic embeddings + reranker still wins.
If you can tolerate 10-15 seconds, BM25 + reasoning is simpler and often better.
“A whole new trend is coming about of rethinking instead of spending all that time upfront creating this embedding model, why not just let my language model which understands semantic relationships handle this on the fly?”
🤖 Agentic RAG: When Simple Retrieval Isn’t Enough
Traditional RAG: One-shot retrieval. Query → retrieve → generate → done.
Agentic RAG: Multi-turn reasoning. Query → retrieve → evaluate → requery → retrieve → evaluate → generate.
Why it works now: LLMs got much better at tool use and reasoning in the last year.
The Pattern
Step 1: Query comes in
Step 2: System retrieves documents
Step 3: LLM evaluates: “Did I properly answer the question?”
Step 4: If no → reformulate query, try different approach
Step 5: Retrieve again
Step 6: Repeat until confident or max iterations
What it looks like in practice:
Different reasoning steps by LLM → Different tool calls → Different queries → All used to write final response.
Real Example: Deep Research
Agentic research systems
Architecture:
Main agent breaks task into multiple parts → Sub-agents work on those parts → Results come back to main agent → Final synthesis
Benchmark: 100 PhD-level tasks
Cost: Easily $1+ per query (lots of token usage across multiple reasoning steps)
Who’s using this: Thomson Reuter’s Westlaw, enterprise internal tools, lots of companies building their own versions.
The Bright Benchmark Results
Bright = Benchmark for complex reasoning tasks
Tests: Simple extraction, keyword matching, multi-hop reasoning, complex queries
One-shot RAG (baseline):
Chunk documents → Retriever → Reranker → Generate
Performance: 76% accuracy (Wix QA dataset)
Latency: ~5 seconds
Agentic RAG (multi-turn reasoning):
Same retrieval stack, but now:
Query → Retrieve → Reflect → Reformulate query → Retrieve again → Evaluate → Generate
Performance: 93% accuracy (Wix QA dataset)
Latency: 50 seconds
17 percentage-point improvement from letting the LLM reason across multiple retrieval attempts.
Why it works:
LLM looks at first results: “I didn’t find what I need. Let me rephrase the question.”
Sometimes initial query isn’t about semantics; it’s about realizing you need different information entirely.
Example:
Initial query: General support question
After first retrieval: “Oh, I need Technical Support Rule 35 specifically.”
Second query: Target that exact rule
One-shot RAG couldn’t do this. Agentic RAG can.
The Tradeoff
Accuracy: 76% → 93% (huge win)
Latency: 5 seconds → 50 seconds (massive cost)
Use cases:
✅ Research tasks (users expect to wait)
✅ Complex analysis (accuracy matters more than speed)
✅ Offline processing (latency doesn’t matter)
❌ Real-time chat (users want 3 seconds max)
❌ High-volume queries (cost explodes)
❌ Simple factual lookup (overkill)
“This opens up a baseline of what’s possible in your system. Why couldn’t even the agentic system with multiple queries not find some information? Is information deeply hidden? The difference between 76% and 93%—the agentic system that did multiple queries found the information. Why didn’t my single-shot find it? Go back and analyze.”
Key insight: Even if you don’t deploy agentic RAG, use it to debug your system.
Run agentic version. See where it improves. Figure out why. Use those insights to improve your single-shot RAG.
🎯 The Solution: LLM Routing
Don’t pick one approach for everything.
Route queries based on complexity:
Simple query (factual extraction, keyword lookup) → One-shot RAG (fast, cheap)
Complex query (multi-hop, reasoning, analysis) → Agentic RAG (slow, expensive, accurate)
How to implement:
Lightweight classifier at the beginning:
Query comes in
Classifier decides: simple or complex?
Routes to appropriate system
You see this already:
GPT-4 has “thinking mode” toggle. Same concept.
Simple query: Standard mode
Complex query: Thinking mode activates
Apply the same pattern to RAG systems.
📦 Operational Considerations
Vector Databases vs In-Memory Storage
Don’t always need a vector database.
If your project has limited vectors, store embeddings in memory (super fast).
As you scale to millions of embeddings, then move to vector database.
The Metadata Problem
As document count grows, retrieval performance degrades.
Solution: Metadata filtering.
Before retrieval:
Filter documents by date, department, document type, author
Reduces search space from 1 million docs to 10,000 docs
Then search the 10,000
Much faster. Much more accurate.
“I usually recommend to people to spend time doing metadata as soon as you get to large data volumes. Having metadata is a very important way so you’re not searching a million documents with every query.”
Entitlements and Permissions (The Thing Nobody Talks About)
Problem: Not every document can be read by every user.
Solution: Two filtering points:
Option 1: Pre-retrieval filtering
Before searching, filter out documents user doesn’t have access to
Only search permitted documents
Option 2: Post-retrieval filtering
Retrieve all documents
Before generation, filter out documents user can’t see
Only use permitted documents in final answer
Why this matters: Compliance. Legal requirements. Enterprise necessity.
Most tutorials skip this. But it’s non-negotiable for production deployment.
🧩 The Chunking Problem (What Everyone Struggles With)
Chunking seems simple: Break documents into pieces.
Reality: Chunking strategy massively impacts retrieval quality.
Why Naive Chunking Fails
Naive approach 1: Fixed token limit
Split every 512 tokens regardless of content.
Problem: Sentences chopped mid-thought. Context lost.
Naive approach 2: Fixed time intervals
Split every 30 seconds of content.
Problem: Speakers cut off mid-sentence. Dialogue fragmented.
Naive approach 3: No awareness
Don’t consider speakers, languages, topics.
Problem: Lose semantic coherence.
What Actually Works
Speaker-aware chunking:
Keep speaker turns together
Don’t split mid-dialogue
Maintains conversational context
Language-aware chunking:
If multiple languages in document, keep homogeneous chunks
Don’t mix languages mid-chunk
Semantic coherence:
Complete thoughts stay together
End chunks at natural sentence boundaries
Preserve context
“Rather than thinking of chunking as a commodity you would import, think of different parts of the process as a core product or feature by themselves. Have an iterative feedback loop of how your chunking mechanism benefits your output translations.”
The principle: Chunking is a core feature, not an afterthought.
Optimize it iteratively based on retrieval quality metrics.
Contextual Chunking (Advanced)
Don’t just chunk text. Add context to each chunk.
What to add:
Metadata:
Document title
Section heading
Chapter number
Page number
Author
Date
Document hierarchy:
“This is from Chapter 3, Section 2.1, under heading ‘Financial Results’”
“This is the third subsection under ‘Quarterly Revenue’”
Why this helps:
When LLM generates response, it knows:
Where this information came from
What section it’s in
What other related sections exist
Enables better reasoning:
“Oh, this chunk is from Q3 2023 subsection. Let me also retrieve Q3 2024 subsection for comparison.”
How much context to add?
Varies by use case:
Some teams add full document summary to every chunk
Some add just section headings
Some add hierarchical breadcrumbs
General recommendation: Document hierarchy approach.
Embed heading structure into chunks. Let users decide how much detail they want.
Works really well for:
Technical documentation
Long reports
Books
Research papers
Less critical for:
Emails (just chunk at message level)
Slides (one chunk per slide works fine)
Short documents
Special Cases: Tables and Images
Tables are brutal.
Small tables: Fit in one chunk (easy)
Large tables: Span multiple chunks (nightmare)
Problem: How do you split without losing column headers?
Approaches that help:
Use HTML representation (preserves structure better than markdown)
Include consistent headers across split chunks
Work on generation side to ensure proper header context
Reality check: Every vendor struggles with this. No perfect solution yet.
Images and graphs:
Use vision language models for image captioning.
Convert visual information to text:
“Graph shows Tesla revenue: 2020 = $31B, 2021 = $53B, 2022 = $81B”
Why: Need numerical text representation for retrieval.
Multimodal embeddings exist, but text conversion generally works better for RAG.
✅ When This Architecture Makes Sense
Build complex RAG systems when:
1. Scale Justifies It
Hundreds of documents → Simple approach fine
Thousands of models → Need robust architecture
Enterprise-wide deployment → Must have governance, permissioning, metadata
2. Cost of Mistakes Is High
Healthcare, legal, financial domains → Need high accuracy, safety mechanisms, audit trails
3. Complex Query Patterns
Users need multi-hop reasoning, document synthesis, analyst-level responses
4. You Have Clear Metrics
Can measure success objectively. Have evaluation framework. Can iterate systematically.
5. Latency Flexibility
Users willing to wait 10-30 seconds for complex queries (research, analysis)
❌ When Simple RAG Is Better
Don’t overengineer when:
1. Small Document Set
Hundreds of docs, infrequent updates → In-memory storage, simple BM25 is fine
2. Simple Queries
Factual extraction, keyword lookup → One-shot RAG with hybrid search sufficient
3. Low Cost of Mistakes
Internal tools, non-critical lookups → Don’t need expensive rerankers and validation
4. Fast Response Required
Users need 3-second responses → Agentic approach too slow
5. Limited Resources
Small team, tight budget → Start with baseline, iterate only when necessary
🎯 The Decision Framework (Use This Monday)
Step 1: Define Your Use Case
Questions to answer:
Who are the users?
What types of queries will they ask?
How complex are the reasoning requirements?
What’s the cost of a mistake?
What latency do users expect?
Output: Understanding of whether you need simple factual Q&A, multi-hop reasoning, or full agentic analyst replacement.
Step 2: Start With BM25 Baseline
Implement:
Simple keyword search with inverted index
Test on sample queries
Validate: Does this handle exact keyword matches well?
If sufficient for your use case: Stop. Ship it.
If not: Proceed to Step 3.
Step 3: Add Semantic Search (Hybrid Approach)
Implement:
Choose embedding model (check MTEB leaderboard)
BM25 + semantic embeddings
Reciprocal rank fusion to merge results
Validate: Does semantic understanding help with synonyms, paraphrasing?
If sufficient: Stop. Ship it.
If not: Proceed to Step 4.
Step 4: Add Reranker
Implement:
Cross-encoder reranker (Cohere, Contextual AI, or open source)
Two-stage retrieval: Fast embeddings → Precise reranking
Validate: Does reranking improve top results? Is latency acceptable?
If sufficient: Stop. Ship it.
If not: Proceed to Step 5.
Step 5: Optimize Chunking
Implement:
Speaker-aware chunking
Contextual information (metadata, hierarchy)
Handle tables and images properly
Validate: Does better chunking improve retrieval quality?
If sufficient: Stop. Ship it.
If not: Proceed to Step 6.
Step 6: Consider Agentic Approach
Implement:
Multi-turn reasoning loops
Query reformulation
Reflection and evaluation steps
Max iteration limits
Validate:
Accuracy improvement vs baseline?
Latency acceptable?
Cost justified?
If complex queries only: Use LLM routing (simple → one-shot, complex → agentic)
Step 7: Add Production Necessities
Implement:
Metadata filtering
Entitlements and permissions
Monitoring and logging
Evaluation framework
Cost tracking
Validate: Production-ready. Compliant. Scalable.
🚀 Key Takeaways
1. RAG is a system, not a model
Stop thinking “I need a good embedding model.” Start thinking about the complete architecture: chunking, retrieval, generation, metadata, permissions.
2. Start with strong baselines
BM25 beats many neural models in production. Hybrid search (BM25 + semantic) is a solid default. Start simple.
3. Understand your use case first
Simple factual Q&A vs complex agentic reasoning require completely different architectures. Cost of mistakes matters. Latency requirements matter.
4. Chunking is a core feature
Not an afterthought. Optimize iteratively. Add contextual information. Handle tables and images properly.
5. Agentic RAG works but costs
76% → 93% accuracy gain. But 5 seconds → 50 seconds latency. Use LLM routing to get best of both worlds.
6. BM25 + smart LLM is underrated
Latest finding: Simple keyword search + GPT-4 matches expensive semantic embeddings for complex reasoning tasks.
7. Metadata and permissions are non-negotiable
Enterprise deployment requires filtering, entitlements, governance. Most tutorials skip this. Don’t.
8. Even if you don’t deploy agentic RAG, use it to debug
Run agentic version to see where your system fails. Analyze gaps. Improve your single-shot RAG based on those insights.
📚 Resources and Tools
Embedding Model Selection:
MTEB Leaderboard (300+ models tested)
Re-rankers:
Benchmarks:
Bright: Complex reasoning task
Wix QA: Tech support queries (public dataset)
NanoBEIR: Small subset of BEIR for quick testing
Vector Databases:
DB-Engines Ranking of Vector DBMS: Consider: Governance, speed, cost, metadata filtering
PageIndex: Document Index for Vectorless, Reasoning-based RAG
Libraries:
FAISS: Optimized for embedding similarity computation
LangChain - Open Deep Research Agent: Simple, configurable, open source deep research agent
Sentence Transformers: framework in python for running, text and image embeddings.
Contextual AI: Instructable reranker
Agno: Open source reasoning agent framework
📹 Watch The Full Workshop
Rajiv covers the complete RAG architecture from first principles
👉 MLOps World 2025 – “RAG Workshop: From Demo to Production”
Level: Intermediate/Advanced | Best for: ML engineers building RAG systems
📤 Worth Sharing?
Know a team whose RAG demo died in production?
Forward this → they’ll stop overengineering and start with baselines that actually work.
Written by Anupama Garani, David Scharbach, Josh Goldstein
AI in Production Field Notes | TMLS
Model Credit: This document was prepared with assistance from Claude, an AI language model developed by Anthropic.












Good grief I have never seen so much interesting information in one place at one time. Awesome resource! Thank you! 😊