Context Assembly: Fitting LLM Prompts in Token Budgets
Every RAG application hits this error eventually:
openai.error.InvalidRequestError:
This model's maximum context length is 8192 tokens.
However, your messages resulted in 12847 tokens.The naive fix is truncation. The right fix is priority-based context assembly.
The Problem
Your LLM context includes:
- System prompt (required)
- User query (required)
- Retrieved documents (important but variable)
- User history (nice to have)
- Feature values (depends on use case)
You don't know the final size until runtime. Document retrieval might return 2 results or 20. User history might be empty or extensive.
The Naive Solution (Don't Do This)
def build_context(docs, history, features):
context = f"""
System: You are helpful.
Docs: {docs}
History: {history}
Features: {features}
"""
# Truncate if too long
if len(context) > 8000:
context = context[:8000]
return contextProblems:
- Character count != token count — "hello" is 1 token, but so is " indistinguishable"
- Arbitrary truncation — you might cut off mid-sentence or mid-document
- No priority — you lose important content while keeping fluff
The Fabra Solution
Fabra's @context decorator handles this properly:
from fabra.context import context, ContextItem
@context(store, max_tokens=4000)
async def build_prompt(user_id: str, query: str):
docs = await search_docs(query)
history = await get_chat_history(user_id)
tier = await store.get_feature("user_tier", user_id)
return [
# Priority 0 = lowest priority number = most important
ContextItem(
content="You are a helpful assistant for our SaaS product.",
priority=0,
required=True # Never drop this
),
ContextItem(
content=f"User tier: {tier}",
priority=1,
required=True # Never drop this either
),
ContextItem(
content=f"Relevant documentation:\n{docs[0]}",
priority=2
),
ContextItem(
content=f"Additional docs:\n{docs[1:]}",
priority=3
),
ContextItem(
content=f"Chat history:\n{history}",
priority=4 # Dropped first if over budget
),
]How It Works
- Token counting — uses
tiktokenfor accurate OpenAI token counts - Priority sorting — items sorted by priority (0 = most important)
- Greedy assembly — adds items until budget exhausted
- Required enforcement —
required=Trueitems always included - Budget error — raises
ContextBudgetErrorif required items exceed budget
The Result
ctx = await build_prompt("user_123", "how do I reset my password?")
print(ctx.content)
# Assembled context string, guaranteed <= 4000 tokens
print(ctx.meta)
# {
# "token_usage": 3847,
# "max_tokens": 4000,
# "items_included": 4,
# "items_dropped": 1,
# "freshness_status": "guaranteed"
# }You know exactly what was included and what was dropped.
Token Counting That Works
Fabra uses tiktoken for accurate token counting:
from fabra.utils.tokens import OpenAITokenCounter
counter = OpenAITokenCounter(model="gpt-4")
tokens = counter.count("Your text here")This matches OpenAI's actual tokenization. No more guessing with len(text) / 4.
Different models use different tokenizers:
- GPT-4 / GPT-3.5:
cl100k_base - Claude: Different tokenizer (approximate)
- Local models: Varies
Fabra defaults to OpenAI's tokenizer but the counter is configurable.
Cost Estimation
Context assembly includes cost estimation:
ctx = await build_prompt("user_123", "query")
print(ctx.meta["cost_usd"])
# 0.000423 (estimated input cost)Useful for monitoring and budgeting API costs.
Rich Display in Notebooks
The Context object has a rich HTML display for Jupyter notebooks:
ctx = await build_prompt("user_123", "query")
ctx # Displays formatted HTML with token usage barShows:
- Freshness status (fresh vs degraded)
- Token usage bar with color coding
- Content preview
- Metadata
Common Patterns
Fixed System Prompt + Variable Retrieval
@context(store, max_tokens=4000)
async def build(query: str):
docs = await search(query) # Variable size
items = [
ContextItem(content=SYSTEM_PROMPT, priority=0, required=True),
]
# Add docs with increasing priority (lower = more important)
for i, doc in enumerate(docs):
items.append(ContextItem(content=doc, priority=i + 1))
return itemsUser Personalization + Documents
@context(store, max_tokens=4000)
async def build(user_id: str, query: str):
tier = await store.get_feature("user_tier", user_id)
prefs = await store.get_feature("preferences", user_id)
docs = await search(query)
return [
ContextItem(content=SYSTEM_PROMPT, priority=0, required=True),
ContextItem(content=f"User: {tier}, prefs: {prefs}", priority=1),
ContextItem(content=str(docs), priority=2),
]Graceful Degradation
@context(store, max_tokens=4000)
async def build(query: str):
try:
detailed_docs = await detailed_search(query)
except TimeoutError:
detailed_docs = []
return [
ContextItem(content=SYSTEM_PROMPT, priority=0, required=True),
ContextItem(content=str(detailed_docs), priority=1),
ContextItem(content="If docs are missing, ask user to clarify.", priority=2),
]Try It
pip install "fabra-ai[ui]"from fabra.core import FeatureStore
from fabra.context import context, ContextItem
store = FeatureStore()
@context(store, max_tokens=1000)
async def demo(query: str):
return [
ContextItem(content="System prompt here", priority=0, required=True),
ContextItem(content="Important info", priority=1),
ContextItem(content="Nice to have " * 100, priority=2), # Will be dropped
]
ctx = await demo("test")
print(f"Used {ctx.meta['token_usage']} tokens")
print(f"Dropped {ctx.meta['items_dropped']} items")