Context Assembly

TL;DR: Use @context to compose context from multiple sources. Set token budgets, assign priorities, and let Fabra handle truncation automatically.

At a Glance


Decorator	`@context(store, max_tokens=4000)`
Token Counting	tiktoken (GPT-4, Claude-3 supported)
Priority	0 = highest (kept first), 3+ = lowest (dropped first)
Required Flag	`required=True` raises error if can't fit
Debug	`store.explain_context()` or `GET /context/{id}/explain`
Freshness	`freshness_sla="5m"` ensures data age (v1.5+)

What is Context Assembly?

LLM prompts have token limits. You need to fit:

System prompt
Retrieved documents
User history
Entity features

Context Assembly combines these sources intelligently, truncating lower-priority items when the budget is exceeded.

Basic Usage

from fabra.context import context, Context, ContextItem

@context(store, max_tokens=4000)
async def chat_context(user_id: str, query: str) -> Context:
    docs = await search_docs(query)
    history = await get_history(user_id)

    return [
        ContextItem(content="You are a helpful assistant.", priority=0, required=True),
        ContextItem(content=str(docs), priority=1, required=True),
        ContextItem(content=history, priority=2),  # Truncated first
    ]

ContextItem

Each piece of context is wrapped in a ContextItem:

ContextItem(
    content="The actual text content",
    priority=1,          # Lower = higher priority (kept first)
    required=False,      # If True, raises error when can't fit
    metadata={"source": "docs"}  # Optional tracking info
)

Priority System

Priority	Description	Example
0	Critical, never truncate	System prompt
1	High priority	Retrieved documents
2	Medium priority	User preferences
3+	Low priority	Suggestions, history

Items are sorted by priority. When over budget, highest-numbered (lowest priority) items are truncated first.

Required Flag

ContextItem(content=docs, priority=1, required=True)

required=True: Raises ContextBudgetError if item can't fit.
required=False (default): Item is silently dropped if over budget.

Token Counting

Fabra uses tiktoken for accurate token counting:

@context(store, max_tokens=4000, model="gpt-4")
async def chat_context(...) -> Context:
    pass

Supported models:

gpt-4, gpt-4-turbo (cl100k_base)
gpt-3.5-turbo (cl100k_base)
claude-3 (approximation)

Truncation Strategies

Default: Drop Items

Lower-priority items are dropped entirely:

@context(store, max_tokens=1000)
async def simple_context(query: str) -> Context:
    return [
        ContextItem(content=short_text, priority=0),     # 100 tokens - kept
        ContextItem(content=medium_text, priority=1),    # 400 tokens - kept
        ContextItem(content=long_text, priority=2),      # 800 tokens - DROPPED
    ]
# Result: 500 tokens (short + medium)

Partial Truncation

Truncate content within an item:

ContextItem(
    content=long_text,
    priority=2,
    # truncate_strategy="end"  # Future: Truncate from end
)

Strategies:

"end": Remove text from end (default for docs)
"start": Remove text from start (for history)
"middle": Keep start and end, remove middle

Explainability

Debug context assembly with the explain API:

# Get detailed trace
trace = await store.explain_context("chat_context", user_id="u1", query="test")
print(trace)

Output:

{
  "context_id": "ctx_abc123",
  "max_tokens": 4000,
  "used_tokens": 3847,
  "items": [
    {"priority": 0, "tokens": 50, "status": "included", "source": "system"},
    {"priority": 1, "tokens": 2800, "status": "included", "source": "docs"},
    {"priority": 2, "tokens": 997, "status": "included", "source": "history"},
    {"priority": 3, "tokens": 500, "status": "truncated", "source": "suggestions"}
  ]
}

Or via HTTP:

curl http://localhost:8000/v1/context/ctx_abc123/explain

Combining with Features

Mix features and retrievers in context:

@context(store, max_tokens=4000)
async def rich_context(user_id: str, query: str) -> Context:
    # Retriever results
    docs = await search_docs(query)

    # Feature values
    prefs = await store.get_feature("user_preferences", user_id)
    tier = await store.get_feature("user_tier", user_id)

    return [
        ContextItem(content=SYSTEM_PROMPT, priority=0, required=True),
        ContextItem(content=str(docs), priority=1, required=True),
        ContextItem(content=f"User tier: {tier}", priority=2),
        ContextItem(content=f"Preferences: {prefs}", priority=3),
    ]

Dynamic Budgets

Adjust budget based on context:

@context(store, max_tokens=4000)
async def adaptive_context(user_id: str, query: str) -> Context:
    tier = await store.get_feature("user_tier", user_id)

    # Premium users get more context
    budget = 8000 if tier == "premium" else 4000

    docs = await search_docs(query, top_k=10 if tier == "premium" else 5)

    return Context(
        items=[...],
        max_tokens=budget  # Override decorator budget
    )

Error Handling

ContextBudgetError

Raised when required items can't fit:

from fabra.context import ContextBudgetError

try:
    ctx = await chat_context(user_id, query)
except ContextBudgetError as e:
    print(f"Required content exceeds budget: {e.required_tokens} > {e.budget}")
    # Fallback: use shorter system prompt

Empty Context

If all items are truncated:

ctx = await minimal_context(user_id, query)
if ctx.is_empty:
    # Handle gracefully
    # Note: In practice, you'd handle this in your app logic
    # by returning a default response directly

Best Practices

Always set priority 0 for system prompt - never truncate instructions.
Mark retrieved docs as required - they're the core of RAG.
Use lower priority for nice-to-have - history, suggestions.
Test with edge cases - very long docs, empty retrievals.
Monitor with explain API - understand truncation patterns.

Performance

Context assembly is fast:

Token counting: ~1ms per 1000 tokens
Priority sorting: O(n log n)
Truncation: O(n)

For very large contexts (>50 items), consider pre-filtering.

Freshness SLAs

Ensure your context uses fresh data with freshness guarantees (v1.5+):

@context(store, max_tokens=4000, freshness_sla="5m")
async def time_sensitive_context(user_id: str, query: str) -> Context:
    tier = await store.get_feature("user_tier", user_id)  # Must be <5m old
    balance = await store.get_feature("account_balance", user_id)
    return [
        ContextItem(content=f"User tier: {tier}", priority=0),
        ContextItem(content=f"Balance: ${balance}", priority=1),
    ]

Checking Freshness

ctx = await time_sensitive_context("user_123", "query")

# Check overall status
print(ctx.is_fresh)  # True if all features within SLA
print(ctx.meta["freshness_status"])  # "guaranteed" or "degraded"

# See violations
for v in ctx.meta["freshness_violations"]:
    print(f"{v['feature']} is {v['age_ms']}ms old (limit: {v['sla_ms']}ms)")

Strict Mode

For critical contexts, fail on stale data:

from fabra.exceptions import FreshnessSLAError

@context(store, freshness_sla="30s", freshness_strict=True)
async def critical_context(...):
    pass  # Raises FreshnessSLAError if any feature exceeds SLA

See Freshness SLAs for the full guide.

FAQ

Q: How do I set a token budget for LLM context? A: Use the @context decorator with max_tokens parameter: @context(store, max_tokens=4000). Fabra automatically truncates lower-priority items when the budget is exceeded.

Q: What happens when context exceeds token limit? A: Items are dropped by priority (highest number first). Items with required=True raise ContextBudgetError if they can't fit. Items without required are silently dropped.

Q: How do I prioritize content in LLM context? A: Set priority on ContextItem: priority=0 (critical, kept first), priority=1 (high), priority=2+ (lower, dropped first). System prompts should always be priority 0.

Q: Does Fabra support token counting for Claude and GPT-4? A: Yes. Fabra uses tiktoken for accurate counting. Specify model: @context(store, max_tokens=4000, model="gpt-4"). Claude-3 uses approximation.

Q: How do I debug context assembly? A: Use the explain API: await store.explain_context("context_name", ...) or HTTP endpoint GET /context/{id}/explain. Returns which items were included, truncated, or dropped.

Q: Can I dynamically change the token budget? A: Yes. Return a Context object with max_tokens set: return Context(items=[...], max_tokens=budget) to override the decorator's default.

Next Steps

Freshness SLAs: Data freshness guarantees
Retrievers: Define semantic search
Event-Driven Features: Fresh context
Use Case: RAG Chatbot: Full example