Token Budget Management for Production RAG
Your RAG application works in development. In production, it fails 5% of the time with:
InvalidRequestError: This model's maximum context length is 8192 tokens.The 5% failure rate comes from edge cases: long documents, verbose chat history, users who paste entire codebases into the chat.
Here's how to handle token budgets properly.
The Real Problem
Token limits aren't just about truncation. They're about:
- Predictable behavior — never fail due to context length
- Quality control — include the most important content
- Cost management — tokens = money
- Latency optimization — fewer tokens = faster responses
Token Counting Basics
Characters != tokens. Words != tokens.
| Text | Characters | Words | Tokens (GPT-4) |
|---|---|---|---|
| "hello" | 5 | 1 | 1 |
| "indistinguishable" | 18 | 1 | 1 |
| "Hello, world!" | 13 | 2 | 4 |
| JSON blob (1KB) | 1000 | ~150 | ~250 |
The only accurate way to count tokens is with the model's tokenizer.
Using tiktoken
import tiktoken
# Get the right encoding for your model
enc = tiktoken.encoding_for_model("gpt-4")
text = "Your text here"
tokens = enc.encode(text)
print(len(tokens)) # Actual token countFabra wraps this:
from fabra.utils.tokens import OpenAITokenCounter
counter = OpenAITokenCounter(model="gpt-4")
count = counter.count("Your text here")Priority-Based Truncation
The naive approach:
def build_context(docs, max_chars=8000):
context = "\n".join(docs)
return context[:max_chars] # Terrible ideaProblems:
- Cuts mid-sentence
- No priority ordering
- Character count != token count
The Right Approach
from fabra.context import context, ContextItem
@context(store, max_tokens=4000)
async def build_prompt(query: str):
docs = await search_docs(query)
return [
# Priority 0 = most important (never dropped if required)
ContextItem(
content="You are a helpful assistant.",
priority=0,
required=True
),
# Priority 1 = important
ContextItem(
content=f"Primary document:\n{docs[0]}",
priority=1
),
# Priority 2-N = progressively less important
ContextItem(
content=f"Supporting docs:\n{docs[1]}",
priority=2
),
ContextItem(
content=f"Additional context:\n{docs[2:]}",
priority=3
),
]Fabra's algorithm:
- Sort items by priority (lowest number = most important)
- Add items until budget exhausted
- Skip items that would exceed budget
- Always include
required=Trueitems (or raise error)
Handling Dynamic Content
Document length varies. Chat history grows. User queries range from 3 words to 3 paragraphs.
Defensive Budgeting
Reserve space for dynamic content:
@context(store, max_tokens=4000)
async def build_prompt(user_id: str, query: str):
# Reserve ~500 tokens for user query
# Reserve ~1000 tokens for response
# Use ~2500 for context
docs = await search_docs(query)
history = await get_history(user_id)
return [
ContextItem(content=SYSTEM_PROMPT, priority=0, required=True),
ContextItem(content=f"Query: {query}", priority=1, required=True),
ContextItem(content=str(docs[:2]), priority=2), # Top 2 docs
ContextItem(content=str(docs[2:]), priority=3), # Rest
ContextItem(content=str(history[-5:]), priority=4), # Last 5 msgs
]Per-Item Limits
For very long documents, truncate individually:
def truncate_doc(doc: str, max_tokens: int = 500) -> str:
counter = OpenAITokenCounter()
tokens = counter.count(doc)
if tokens <= max_tokens:
return doc
# Binary search for right length
# (Or use a chunking strategy)
...Cost Monitoring
Tokens cost money:
| Model | Input Cost | Output Cost |
|---|---|---|
| GPT-4 Turbo | $0.01/1K | $0.03/1K |
| GPT-3.5 Turbo | $0.0005/1K | $0.0015/1K |
| Claude 3.5 Sonnet | $0.003/1K | $0.015/1K |
Fabra tracks costs:
ctx = await build_prompt("user_123", "query")
print(ctx.meta["token_usage"]) # 3847
print(ctx.meta["cost_usd"]) # 0.0384 (estimated input cost)Cost Alerts
@context(store, max_tokens=4000)
async def build_prompt(query: str):
items = [...]
return items
# After assembly
ctx = await build_prompt("expensive query")
if ctx.meta["cost_usd"] > 0.10:
logger.warning(f"High cost context: ${ctx.meta['cost_usd']:.4f}")Latency Optimization
Fewer tokens = faster responses. LLM latency scales roughly linearly with output tokens and sub-linearly with input tokens.
Strategies
- Tighter budgets for simple queries
@context(store, max_tokens=2000) # Small budget
async def simple_query(query: str):
...
@context(store, max_tokens=8000) # Large budget
async def complex_query(query: str):
...- Streaming for long responses
Token budget applies to input. For output, use streaming to improve perceived latency.
- Caching
Fabra caches context assembly:
@context(store, max_tokens=4000, cache_ttl="5m")
async def build_prompt(query: str):
...Same query within 5 minutes? Return cached context.
Production Patterns
Fallback on Budget Error
from fabra.context import ContextBudgetError
try:
ctx = await build_prompt(query)
except ContextBudgetError:
# Required items exceeded budget
ctx = await build_minimal_prompt(query)Monitoring Dashboard
Track these metrics:
- Token usage distribution (p50, p95, p99)
- Budget exhaustion rate (how often we drop content)
- Cost per query
- Items dropped per query
Fabra exposes Prometheus metrics:
# fabra_context_tokens_total
# fabra_context_items_dropped_total
# fabra_context_budget_exceeded_totalA/B Testing Budgets
Different budgets affect quality:
import random
@context(store, max_tokens=4000 if random.random() > 0.5 else 2000)
async def build_prompt(query: str):
...Measure: response quality, user satisfaction, cost.
Common Mistakes
1. Hardcoding Character Limits
# Wrong
context = text[:8000]
# Right
from fabra.utils.tokens import OpenAITokenCounter
counter = OpenAITokenCounter()
# Use counter.count() and proper truncation2. Ignoring Required Content
# Wrong: might drop system prompt
items = [
ContextItem(content=system_prompt, priority=0), # Not required!
ContextItem(content=docs, priority=1),
]
# Right
items = [
ContextItem(content=system_prompt, priority=0, required=True),
ContextItem(content=docs, priority=1),
]3. Not Monitoring Drops
ctx = await build_prompt(query)
# Always log this in production
if ctx.meta["items_dropped"] > 0:
logger.info(f"Dropped {ctx.meta['items_dropped']} items for query: {query[:50]}")Try It
pip install "fabra-ai[ui]"from fabra.core import FeatureStore
from fabra.context import context, ContextItem
store = FeatureStore()
@context(store, max_tokens=1000)
async def demo(query: str):
return [
ContextItem(content="System prompt", priority=0, required=True),
ContextItem(content="Important", priority=1),
ContextItem(content="Nice to have " * 50, priority=2),
ContextItem(content="Filler " * 100, priority=3),
]
ctx = await demo("test")
print(f"Tokens: {ctx.meta['token_usage']}")
print(f"Dropped: {ctx.meta['items_dropped']}")
print(f"Cost: ${ctx.meta['cost_usd']:.6f}")