Context Assembly
TL;DR: Use
@contextto compose context from multiple sources. Set token budgets, assign priorities, and let Fabra handle truncation automatically.
At a Glance
| Decorator | @context(store, max_tokens=4000) |
| Token Counting | tiktoken (GPT-4, Claude-3 supported) |
| Priority | 0 = highest (kept first), 3+ = lowest (dropped first) |
| Required Flag | required=True is never dropped (no exception in MVP; overflow is flagged) |
| Debug | fabra context show or GET /v1/context/{id}/explain |
| Freshness | freshness_sla="5m" ensures data age (v1.5+) |
What is Context Assembly?
LLM prompts have token limits. You need to fit:
- System prompt
- Retrieved documents
- User history
- Entity features
Context Assembly combines these sources intelligently, truncating lower-priority items when the budget is exceeded.
Basic Usage
from fabra.context import context, Context, ContextItem
@context(store, max_tokens=4000)
async def chat_context(user_id: str, query: str) -> Context:
docs = await search_docs(query)
history = await get_history(user_id)
return [
ContextItem(content="You are a helpful assistant.", priority=0, required=True),
ContextItem(content=str(docs), priority=1, required=True),
ContextItem(content=history, priority=2), # Truncated first
]ContextItem
Each piece of context is wrapped in a ContextItem:
ContextItem(
content="The actual text content",
priority=1, # Lower = higher priority (kept first)
required=False, # If True, this item is never dropped
source_id="docs:kb:v1" # Optional identifier for caching/lineage
)Priority System
| Priority | Description | Example |
|---|---|---|
| 0 | Critical, never truncate | System prompt |
| 1 | High priority | Retrieved documents |
| 2 | Medium priority | User preferences |
| 3+ | Low priority | Suggestions, history |
Items are sorted by priority. When over budget, highest-numbered (lowest priority) items are truncated first.
Required Flag
ContextItem(content=docs, priority=1, required=True)required=True: Item is never dropped.required=False(default): Item is eligible to be dropped if over budget.
Token Counting
Fabra uses tiktoken for accurate token counting:
@context(store, max_tokens=4000, model="gpt-4")
async def chat_context(...) -> Context:
passSupported models:
gpt-4,gpt-4-turbo(cl100k_base)gpt-3.5-turbo(cl100k_base)claude-3(approximation)
Truncation Strategies
Default: Drop Items
Lower-priority items are dropped entirely:
@context(store, max_tokens=1000)
async def simple_context(query: str) -> Context:
return [
ContextItem(content=short_text, priority=0), # 100 tokens - kept
ContextItem(content=medium_text, priority=1), # 400 tokens - kept
ContextItem(content=long_text, priority=2), # 800 tokens - DROPPED
]
# Result: 500 tokens (short + medium)Partial Truncation
Truncate content within an item:
ContextItem(
content=long_text,
priority=2,
# truncate_strategy="end" # Future: Truncate from end
)Strategies:
"end": Remove text from end (default for docs)"start": Remove text from start (for history)"middle": Keep start and end, remove middle
Note: Partial truncation strategies are not implemented in the current MVP; items are dropped as whole units.
Explainability
Debug context assembly with the explain API:
curl http://localhost:8000/v1/context/ctx_abc123/explainCombining with Features
Mix features and retrievers in context:
@context(store, max_tokens=4000)
async def rich_context(user_id: str, query: str) -> Context:
# Retriever results
docs = await search_docs(query)
# Feature values
prefs = await store.get_feature("user_preferences", user_id)
tier = await store.get_feature("user_tier", user_id)
return [
ContextItem(content=SYSTEM_PROMPT, priority=0, required=True),
ContextItem(content=str(docs), priority=1, required=True),
ContextItem(content=f"User tier: {tier}", priority=2),
ContextItem(content=f"Preferences: {prefs}", priority=3),
]Dynamic Budgets
Adjust budget based on context:
Dynamic per-request budgets are not supported in the current MVP. If you need tier-based budgets, define two separate context functions (or pass different `max_tokens` values via decorator in different deployments).
## Error Handling
### Budget overflow
Fabra drops optional items first. If required items still exceed the budget, the context is returned and `meta["budget_exceeded"]` is set to `True`:
```python
ctx = await chat_context(user_id, query)
if ctx.meta.get("budget_exceeded"):
# e.g. shorten the system prompt, reduce retrieval top_k, etc.
passBest Practices
- Always set priority 0 for system prompt - never truncate instructions.
- Mark retrieved docs as required - they're the core of RAG.
- Use lower priority for nice-to-have - history, suggestions.
- Test with edge cases - very long docs, empty retrievals.
- Monitor with explain API - understand truncation patterns.
Performance
Context assembly is fast:
- Token counting: ~1ms per 1000 tokens
- Priority sorting: O(n log n)
- Truncation: O(n)
For very large contexts (>50 items), consider pre-filtering.
Freshness SLAs
Ensure your context uses fresh data with freshness guarantees (v1.5+):
@context(store, max_tokens=4000, freshness_sla="5m")
async def time_sensitive_context(user_id: str, query: str) -> Context:
tier = await store.get_feature("user_tier", user_id) # Must be <5m old
balance = await store.get_feature("account_balance", user_id)
return [
ContextItem(content=f"User tier: {tier}", priority=0),
ContextItem(content=f"Balance: ${balance}", priority=1),
]Checking Freshness
ctx = await time_sensitive_context("user_123", "query")
# Check overall status
print(ctx.is_fresh) # True if all features within SLA
print(ctx.meta["freshness_status"]) # "guaranteed" or "degraded"
# See violations
for v in ctx.meta["freshness_violations"]:
print(f"{v['feature']} is {v['age_ms']}ms old (limit: {v['sla_ms']}ms)")Strict Mode
For critical contexts, fail on stale data:
from fabra.exceptions import FreshnessSLAError
@context(store, freshness_sla="30s", freshness_strict=True)
async def critical_context(...):
pass # Raises FreshnessSLAError if any feature exceeds SLASee Freshness SLAs for the full guide.
FAQ
Q: How do I set a token budget for LLM context?
A: Use the @context decorator with max_tokens parameter: @context(store, max_tokens=4000). Fabra automatically truncates lower-priority items when the budget is exceeded.
Q: What happens when context exceeds token limit?
A: Optional items (required=False) are dropped first (highest priority numbers first). Required items are never dropped. If required items still exceed the budget, the context is returned and meta["budget_exceeded"]=true is set.
Q: How do I prioritize content in LLM context?
A: Set priority on ContextItem: priority=0 (critical, kept first), priority=1 (high), priority=2+ (lower, dropped first). System prompts should always be priority 0.
Q: Does Fabra support token counting for Claude and GPT-4?
A: Yes. Fabra uses tiktoken for accurate counting. Specify model: @context(store, max_tokens=4000, model="gpt-4"). Claude-3 uses approximation.
Q: How do I debug context assembly?
A: Use GET /v1/context/{id}/explain (metadata-only trace) and fabra context show (full CRS-001 record). For a visual view, use GET /v1/context/{id}/visualize.
Q: Can I dynamically change the token budget?
A: Not in the current MVP. Use separate context functions (or deployments) with different max_tokens.
Next Steps
- Freshness SLAs: Data freshness guarantees
- Retrievers: Define semantic search
- Event-Driven Features: Fresh context
- Use Case: RAG Chatbot: Full example