Asof / feature store / point-in-time join
sources offline store as-of join training set | where leakage dies.

The join that keeps the future out of your training data.

Asof is a zero-dependency feature store written in plain Python. It is built around one operation: the point-in-time (as-of) join, which fetches the most recent feature value known as of each label's timestamp and never a microsecond later. This is the map of how the whole thing fits together.

The system map

One store, two paths out

orders.csv JSONL · dicts feature views schema · TTL Offline store full history · sqlite as-of join feature_ts ≤ label_ts Training set no leakage materialize latest wins Online store latest per entity feature server → model naive last-value join grabs the latest row regardless of label time → leaks the future, inflates metrics
point-in-time path (correct) naive last-value path (leaks) data flow

Try it

Drag the label time. Watch which value gets picked.

One customer, six feature events

Each tick is a feature row with a timestamp and a rolling_7d_spend value. Move the label time t. The as-of join takes the latest event at or before t. The naive join always grabs event #6, even when it sits in the future of t.

as-of join · correct
most recent value at or before t
naive last-value · leaks
always event #6, the latest row

Why it is built this way

Three decisions that matter

01 · correctness

Linear, and provably right

The join is an O(n+m) two-pointer merge over sorted events. A brute-force O(n*m) reference runs beside it, and the test suite asserts they return identical results on random data every run.

02 · time

Microsecond ordering

Timestamps normalize to tz-aware UTC and store as epoch microseconds, so sqlite orders them exactly and TTL math is plain integer arithmetic. Ties resolve to the last source row, deterministically.

03 · serving

Materialize never regresses

Sweeping offline rows into the online store upserts latest-wins and refuses to move an entity backwards in time, so re-running a window is idempotent and safe.

The whole API

Apply, train, materialize, serve

# register definitions, load history into the offline store
store.apply(customer, customer_stats)
store.ingest("customer_stats", source="orders.csv")

# TRAIN: as-of join, every value known only up to its label time
training = store.get_historical_features(labels, "churn_v1")

# SERVE: materialize latest values, then fetch them fast
store.materialize(start, end)
store.get_online_features(["cust_001"], "churn_v1")

# the WRONG join, shown so the demo can prove it leaks
leaky = naive_last_value_join(labels, feature_rows, "customer_id", ["rolling_7d_spend"])

Measured on the bundled dataset

Numbers from CI

336x
faster than the naive nested loop (50k rows)
40 / 40
training rows the naive join leaks on
40+
unittest cases, zero dependencies
O(n+m)
merge after the sort