Point-in-Time Features: Preventing ML Data Leakage
Your fraud model has 99% accuracy in training. In production, it's 60%.
The culprit? Data leakage. Your training data used feature values that didn't exist at prediction time.
The Data Leakage Problem
Imagine you're building a fraud detection model. You have:
- Events table: transactions with timestamps and labels (fraud/not fraud)
- Features table: user behavior metrics updated throughout the day
Events:
| user_id | timestamp | is_fraud |
|---------|---------------------|----------|
| user_1 | 2024-01-15 10:00:00 | true |
| user_1 | 2024-01-15 14:00:00 | false |
Features (transaction_count_1h):
| user_id | timestamp | value |
|---------|---------------------|-------|
| user_1 | 2024-01-15 09:00:00 | 2 |
| user_1 | 2024-01-15 11:00:00 | 5 | <- Updated AFTER fraud
| user_1 | 2024-01-15 15:00:00 | 3 |The wrong way:
SELECT e.*, f.value as transaction_count
FROM events e
JOIN features f ON e.user_id = f.user_id
WHERE f.feature_name = 'transaction_count_1h'This joins the latest feature value. For the 10:00 fraud event, it might use the 11:00 or 15:00 feature value — data that didn't exist at prediction time.
The right way:
SELECT e.*, f.value as transaction_count
FROM events e
ASOF JOIN features f
ON e.user_id = f.user_id
AND f.timestamp <= e.timestamp
WHERE f.feature_name = 'transaction_count_1h'This joins the most recent feature value that existed at event time. For the 10:00 fraud event, it uses the 09:00 feature value.
How Fabra Handles This
Fabra automatically logs feature values with timestamps:
from datetime import timedelta
@feature(entity=User, refresh=timedelta(minutes=5))
def transaction_count_1h(user_id: str) -> int:
return count_transactions(user_id, hours=1)Every time this feature is computed, Fabra logs:
| user_id | feature_name | value | timestamp |
|---------|-----------------------|-------|---------------------|
| user_1 | transaction_count_1h | 2 | 2024-01-15 09:00:00 |
| user_1 | transaction_count_1h | 5 | 2024-01-15 09:05:00 |
| user_1 | transaction_count_1h | 5 | 2024-01-15 09:10:00 |When you generate training data, Fabra uses the appropriate temporal join.
DuckDB: ASOF JOIN
For local development, Fabra uses DuckDB's native ASOF JOIN:
SELECT
events.*,
features.value as transaction_count
FROM events
ASOF JOIN features
ON events.user_id = features.user_id
AND features.timestamp <= events.timestampDuckDB's ASOF JOIN finds the most recent matching row efficiently.
Postgres: LATERAL JOIN
Postgres doesn't have ASOF JOIN, so Fabra uses LATERAL JOIN:
SELECT
e.*,
f.value as transaction_count
FROM events e
LEFT JOIN LATERAL (
SELECT value
FROM features
WHERE user_id = e.user_id
AND feature_name = 'transaction_count_1h'
AND timestamp <= e.timestamp
ORDER BY timestamp DESC
LIMIT 1
) f ON trueSame semantics, slightly different syntax.
Generating Training Data
from fabra.core import FeatureStore
store = FeatureStore()
# Your events DataFrame
events = pd.DataFrame({
"user_id": ["user_1", "user_1", "user_2"],
"timestamp": ["2024-01-15 10:00", "2024-01-15 14:00", "2024-01-15 12:00"],
"label": [1, 0, 0]
})
# Get point-in-time correct features
training_data = await store.get_historical_features(
entity_df=events,
features=["transaction_count_1h", "avg_purchase_amount"]
)The result includes feature values as they existed at each event's timestamp.
Time Travel for Debugging
Fabra supports "time travel" queries for debugging:
from fabra.core import get_context
# Get features as they existed at a specific time
async with store.time_travel(timestamp="2024-01-15 09:30:00"):
value = await store.get_feature("transaction_count_1h", "user_1")
# Returns value from 09:00 (most recent before 09:30)Useful for investigating production incidents: "What features did the model see when it made this prediction?"
Common Pitfalls
1. Joining on Latest Value
Wrong:
SELECT * FROM events e
JOIN features f ON e.user_id = f.user_id
WHERE f.feature_name = 'x'
AND f.timestamp = (SELECT MAX(timestamp) FROM features)Right: Use temporal joins (ASOF or LATERAL).
2. Aggregating Future Data
Wrong:
@feature(entity=User)
def total_purchases(user_id: str) -> int:
# This includes ALL purchases, including future ones
return db.query("SELECT COUNT(*) FROM purchases WHERE user_id = %s", user_id)Right:
from datetime import timedelta
@feature(entity=User, refresh=timedelta(minutes=5))
def total_purchases(user_id: str) -> int:
# Log current value, let temporal join handle point-in-time
return db.query("SELECT COUNT(*) FROM purchases WHERE user_id = %s", user_id)The temporal join ensures training uses the value that existed at prediction time.
3. Using Event Timestamps for Features
Wrong:
def get_features(event_timestamp):
# Computing feature from scratch at event time
# Slow and error-prone
return expensive_computation(as_of=event_timestamp)Right: Log feature values continuously, let the temporal join handle retrieval.
The Accuracy Impact
Teams typically see 5-15% accuracy improvement when fixing data leakage issues.
A fraud model we worked with went from 62% production accuracy to 78% after implementing proper point-in-time features. The training accuracy was unchanged (it was already high), but production finally matched.
Try It
pip install "fabra-ai[ui]"from datetime import timedelta
from fabra.core import FeatureStore, entity, feature
store = FeatureStore()
@entity(store)
class User:
user_id: str
@feature(entity=User, refresh=timedelta(minutes=5))
def login_count(user_id: str) -> int:
return get_login_count(user_id)
# Features are automatically logged with timestamps
# Use get_historical_features for training data