Skip to main content

Thread Transfer

Smart model routing: Cut your AI costs by 70%

One team dropped their bill from $42k to $29k monthly by routing 70% of requests to smaller models. Here's the playbook.

Jorgo Bardho

Founder, Thread Transfer

March 9, 20259 min read
model routingcost optimizationLLM orchestration
Decision tree for model routing

Not every query needs GPT-4. Simple tasks like intent classification, entity extraction, and FAQ matching work perfectly on smaller, faster, cheaper models. Smart model routing classifies incoming requests by complexity and sends them to the appropriate model tier. Teams that nail this cut costs by 60-70% while maintaining—or even improving—quality. Here's the playbook.

What is model routing?

Model routing is a decision layer that sits between the user request and your LLM fleet. It analyzes the incoming query, estimates complexity, and dispatches it to the right model: a budget-tier option like Gemini Flash for simple tasks, a mid-tier workhorse like GPT-4o for standard reasoning, or a frontier model like Claude 4 Opus for high-stakes analysis.

The goal: maximize cost efficiency without sacrificing accuracy. Done well, you're spending $0.10 per request instead of $1.00, and users can't tell the difference.

How it works: The classification step

Before sending a query to an expensive model, run it through a fast classifier. Options include:

  • Heuristic rules. Simple keyword matching, token count thresholds, or regex patterns. Works for well-defined domains (e.g., "if query contains 'pricing', route to FAQ model").
  • Embedding similarity. Compare the query embedding to a set of reference embeddings for "simple" vs. "complex" questions. Fast, cheap, and surprisingly effective.
  • Small classifier model. Train a lightweight model (BERT, DistilBERT) to predict complexity on a 1-5 scale. Latency under 50ms, cost under $0.001 per request.
  • LLM self-routing. Use a cheap model (GPT-3.5, Gemini Flash) to classify whether the query needs escalation. Adds 200-500ms latency but handles nuanced cases.

Decision criteria: When to route where

Here's a decision matrix teams use in production:

  • Tier 1 (Budget models: Gemini Flash, GPT-3.5). FAQ lookup, intent classification, simple entity extraction, sentiment analysis, short summaries under 200 tokens.
  • Tier 2 (Mid-tier: GPT-4o, Claude 3.5 Sonnet). Multi-step reasoning, code generation, document summarization 200-2000 tokens, customer support responses, data transformation.
  • Tier 3 (Frontier: Claude 4 Opus, GPT-4 Turbo). Legal analysis, complex medical reasoning, large-scale research synthesis, adversarial red-teaming, creative writing with strict constraints.

Rule of thumb: if the task can be validated with a simple check (keyword match, format verification), route it to Tier 1. If it requires multi-hop reasoning or domain expertise, escalate to Tier 2 or 3.

Implementation patterns

Pattern 1: Static routing. Hard-code rules based on request type. Fast to build, zero ML overhead. Example: all "summarize" endpoints hit GPT-4o; all "extract entities" hit Gemini Flash.

Pattern 2: Confidence-based escalation. Start with a cheap model. If the model returns low confidence (via logprobs or a self-critique prompt), retry with a stronger model. Works well for 80/20 splits.

Pattern 3: Multi-stage pipeline. Use a fast model to generate candidates, then a strong model to rank or refine them. Common in search and recommendation systems.

Pattern 4: User-controlled routing. Let power users choose model tier explicitly. Pair this with usage quotas to prevent abuse.

Real savings examples

Case 1: Customer support platform. Routed 70% of queries to Gemini Flash for intent classification and FAQ matching. Only escalated complex or ambiguous requests to GPT-4o. Result: monthly bill dropped from $42k to $29k (31% reduction) with identical CSAT scores.

Case 2: Legal document analysis. Used embeddings to classify contracts as "standard" or "non-standard." Standard contracts went to GPT-4o for clause extraction. Non-standard went to Claude 4 Opus for deep analysis. Saved 58% on costs while improving accuracy on edge cases.

Case 3: E-commerce chatbot. Heuristic routing: product questions hit a RAG pipeline with Gemini Flash. Refund or complaint queries escalated to GPT-4o with full conversation history. Average cost per conversation fell from $0.18 to $0.07 (61% drop).

Common pitfalls

  • Over-classification. If your classifier sends 80% of traffic to the expensive model, you're just adding latency and complexity. Aim for 60-70% hitting budget models.
  • Ignoring latency. Adding a classification step increases total latency by 100-500ms. For real-time use cases, cache routing decisions or use async patterns.
  • No fallback strategy. If the budget model fails or returns garbage, retry with a stronger model automatically. Don't force users into dead ends.
  • Stale routing logic. As models improve, routing rules go stale. Re-evaluate quarterly: today's Tier 2 model might handle yesterday's Tier 3 tasks.

Monitoring and tuning

Track these metrics per model tier: cost per request, latency p50/p95, accuracy/quality score (via human eval or model grading), escalation rate. Set alerts for anomalies: sudden cost spikes, escalation rate jumps, or quality drops.

Run A/B tests on routing thresholds. Example: does routing "medium complexity" queries to Tier 1 vs. Tier 2 change user satisfaction? Optimize for cost-per-successful-interaction, not just raw cost.

Closing thoughts

Smart routing is the fastest way to cut AI costs without touching your application logic. Start simple: heuristic rules for obvious splits. Add complexity only when the data justifies it. Pair routing with Thread-Transfer bundles to compress context on top of model selection—compound savings hit 70-80%.

Want routing decision trees and monitoring dashboards? Get in touch.