F
Fabra

Context Assembly

TL;DR: Use @context to compose context from multiple sources. Set token budgets, assign priorities, and let Fabra handle truncation automatically.

At a Glance

Decorator @context(store, max_tokens=4000)
Token Counting tiktoken (GPT-4, Claude-3 supported)
Priority 0 = highest (kept first), 3+ = lowest (dropped first)
Required Flag required=True is never dropped (no exception in MVP; overflow is flagged)
Debug fabra context show or GET /v1/context/{id}/explain
Freshness freshness_sla="5m" ensures data age (v1.5+)

What is Context Assembly?

LLM prompts have token limits. You need to fit:

  • System prompt
  • Retrieved documents
  • User history
  • Entity features

Context Assembly combines these sources intelligently, truncating lower-priority items when the budget is exceeded.

Basic Usage

from fabra.context import context, Context, ContextItem

@context(store, max_tokens=4000)
async def chat_context(user_id: str, query: str) -> Context:
    docs = await search_docs(query)
    history = await get_history(user_id)

    return [
        ContextItem(content="You are a helpful assistant.", priority=0, required=True),
        ContextItem(content=str(docs), priority=1, required=True),
        ContextItem(content=history, priority=2),  # Truncated first
    ]

ContextItem

Each piece of context is wrapped in a ContextItem:

ContextItem(
    content="The actual text content",
    priority=1,          # Lower = higher priority (kept first)
    required=False,      # If True, this item is never dropped
    source_id="docs:kb:v1"  # Optional identifier for caching/lineage
)

Priority System

Priority Description Example
0 Critical, never truncate System prompt
1 High priority Retrieved documents
2 Medium priority User preferences
3+ Low priority Suggestions, history

Items are sorted by priority. When over budget, highest-numbered (lowest priority) items are truncated first.

Required Flag

ContextItem(content=docs, priority=1, required=True)
  • required=True: Item is never dropped.
  • required=False (default): Item is eligible to be dropped if over budget.

Token Counting

Fabra uses tiktoken for accurate token counting:

@context(store, max_tokens=4000, model="gpt-4")
async def chat_context(...) -> Context:
    pass

Supported models:

  • gpt-4, gpt-4-turbo (cl100k_base)
  • gpt-3.5-turbo (cl100k_base)
  • claude-3 (approximation)

Truncation Strategies

Default: Drop Items

Lower-priority items are dropped entirely:

@context(store, max_tokens=1000)
async def simple_context(query: str) -> Context:
    return [
        ContextItem(content=short_text, priority=0),     # 100 tokens - kept
        ContextItem(content=medium_text, priority=1),    # 400 tokens - kept
        ContextItem(content=long_text, priority=2),      # 800 tokens - DROPPED
    ]
# Result: 500 tokens (short + medium)

Partial Truncation

Truncate content within an item:

ContextItem(
    content=long_text,
    priority=2,
    # truncate_strategy="end"  # Future: Truncate from end
)

Strategies:

  • "end": Remove text from end (default for docs)
  • "start": Remove text from start (for history)
  • "middle": Keep start and end, remove middle

Note: Partial truncation strategies are not implemented in the current MVP; items are dropped as whole units.

Explainability

Debug context assembly with the explain API:

curl http://localhost:8000/v1/context/ctx_abc123/explain

Combining with Features

Mix features and retrievers in context:

@context(store, max_tokens=4000)
async def rich_context(user_id: str, query: str) -> Context:
    # Retriever results
    docs = await search_docs(query)

    # Feature values
    prefs = await store.get_feature("user_preferences", user_id)
    tier = await store.get_feature("user_tier", user_id)

    return [
        ContextItem(content=SYSTEM_PROMPT, priority=0, required=True),
        ContextItem(content=str(docs), priority=1, required=True),
        ContextItem(content=f"User tier: {tier}", priority=2),
        ContextItem(content=f"Preferences: {prefs}", priority=3),
    ]

Dynamic Budgets

Adjust budget based on context:

Dynamic per-request budgets are not supported in the current MVP. If you need tier-based budgets, define two separate context functions (or pass different `max_tokens` values via decorator in different deployments).

## Error Handling

### Budget overflow

Fabra drops optional items first. If required items still exceed the budget, the context is returned and `meta["budget_exceeded"]` is set to `True`:

```python
ctx = await chat_context(user_id, query)
if ctx.meta.get("budget_exceeded"):
    # e.g. shorten the system prompt, reduce retrieval top_k, etc.
    pass

Best Practices

  1. Always set priority 0 for system prompt - never truncate instructions.
  2. Mark retrieved docs as required - they're the core of RAG.
  3. Use lower priority for nice-to-have - history, suggestions.
  4. Test with edge cases - very long docs, empty retrievals.
  5. Monitor with explain API - understand truncation patterns.

Performance

Context assembly is fast:

  • Token counting: ~1ms per 1000 tokens
  • Priority sorting: O(n log n)
  • Truncation: O(n)

For very large contexts (>50 items), consider pre-filtering.

Freshness SLAs

Ensure your context uses fresh data with freshness guarantees (v1.5+):

@context(store, max_tokens=4000, freshness_sla="5m")
async def time_sensitive_context(user_id: str, query: str) -> Context:
    tier = await store.get_feature("user_tier", user_id)  # Must be <5m old
    balance = await store.get_feature("account_balance", user_id)
    return [
        ContextItem(content=f"User tier: {tier}", priority=0),
        ContextItem(content=f"Balance: ${balance}", priority=1),
    ]

Checking Freshness

ctx = await time_sensitive_context("user_123", "query")

# Check overall status
print(ctx.is_fresh)  # True if all features within SLA
print(ctx.meta["freshness_status"])  # "guaranteed" or "degraded"

# See violations
for v in ctx.meta["freshness_violations"]:
    print(f"{v['feature']} is {v['age_ms']}ms old (limit: {v['sla_ms']}ms)")

Strict Mode

For critical contexts, fail on stale data:

from fabra.exceptions import FreshnessSLAError

@context(store, freshness_sla="30s", freshness_strict=True)
async def critical_context(...):
    pass  # Raises FreshnessSLAError if any feature exceeds SLA

See Freshness SLAs for the full guide.

FAQ

Q: How do I set a token budget for LLM context? A: Use the @context decorator with max_tokens parameter: @context(store, max_tokens=4000). Fabra automatically truncates lower-priority items when the budget is exceeded.

Q: What happens when context exceeds token limit? A: Optional items (required=False) are dropped first (highest priority numbers first). Required items are never dropped. If required items still exceed the budget, the context is returned and meta["budget_exceeded"]=true is set.

Q: How do I prioritize content in LLM context? A: Set priority on ContextItem: priority=0 (critical, kept first), priority=1 (high), priority=2+ (lower, dropped first). System prompts should always be priority 0.

Q: Does Fabra support token counting for Claude and GPT-4? A: Yes. Fabra uses tiktoken for accurate counting. Specify model: @context(store, max_tokens=4000, model="gpt-4"). Claude-3 uses approximation.

Q: How do I debug context assembly? A: Use GET /v1/context/{id}/explain (metadata-only trace) and fabra context show (full CRS-001 record). For a visual view, use GET /v1/context/{id}/visualize.

Q: Can I dynamically change the token budget? A: Not in the current MVP. Use separate context functions (or deployments) with different max_tokens.


Next Steps