Columna — a pure-Python columnar storage engine

Why columnar

Row storage reads everything. Columnar reads only what the query needs.

A CSV stores values row by row, so a query touching one column still drags the whole file off disk. Columna stores each column's values together, compressed and encoded, in independent row groups. The footer at the end of the file holds every column chunk's byte offset and min/max stats — so the reader knows exactly where to seek, and which chunks it can ignore entirely, before reading a single value.

# write a CSV straight to a columnar .cna file
import columna as cna
from columna import col

cna.write_csv("orders.csv", "orders.cna", row_group_size=5000)

# projection pushdown — only these two column chunks are read off disk
table = cna.read("orders.cna", columns=["order_id", "amount"])

# predicate pushdown — whole row groups skipped via footer min/max stats
result = cna.scan("orders.cna",
                  columns=["order_id", "amount"],
                  where=(col("amount") > 4500) & (col("region") == "EMEA"))

print(result.summary())
# 343 matched / 6000 scanned, 7/10 groups skipped, 21734 bytes read

The heart of it

Five encodings. The writer trial-runs each and keeps the smallest.

No heuristics that can pick a worse encoding than the baseline — every column is actually encoded each way, and the fewest bytes wins. Then an optional DEFLATE pass mops up any byte-level redundancy left over.

DELTA

Monotonic integers → tiny gaps

Auto-increment ids and timestamps have huge values but small gaps. Store the first value, then zig-zag varints of each delta. A run like 100000, 100001, 100003 collapses to one full int plus single-byte deltas.

order_id, ts → DELTA + varint

DICTIONARY

Low-cardinality → index stream

Four distinct regions across 5,000 rows? Store the four values once, replace the column with small integer indices, then bit-pack (or RLE) the indices. The classic categorical win.

region, note → DICT + bitpack

RLE

Long runs → (value, length)

Clustered status flags compress to a handful of (value, run_length) pairs instead of thousands of repeats.

status → RLE

BITPACK

Tight ranges → minimum bits

Subtract the column min, then pack residuals into the fewest bits that fit. Quantities of 1–12 need 4 bits each, not a whole byte.

quantity → BITPACK (4 bits)

NULLS

Definition bitmaps — one bit per row, zero wasted value bytes

Nullable columns store only their non-null values; a separate validity bitmap (1 bit/row) records which positions are null, exactly like Parquet's definition levels. An all-null column costs ~1 bit/row and zero value bytes. A column with no nulls skips the bitmap entirely.

The payoff

Predicate pushdown that's honest about what it skips.

Each row group records min/max per column in the footer. For amount > 4500, any group whose max amount is ≤ 4500 can't contain a match — so it's skipped without decoding a single byte. If a column's stats are missing or NaN-polluted, Columna refuses to skip: correctness always beats speed. The whole thing is verified in the test suite against a brute-force filter on random data — skipping never drops a matching row.

# can_skip proves a group is empty using only footer stats
def can_skip(self, stats):
    mn, mx = stats[self.column]["min"], stats[self.column]["max"]
    if mn is None or mx is None:
        return False          # unreliable stats → never skip
    if self.op == ">":  return mx <= self.value
    if self.op == "<":  return mn >= self.value
    if self.op == "==": return self.value < mn or self.value > mx

Real numbers, reproducible

Columna vs CSV on 50,000 rows.

Generated by the committed benchmark script. Bytes-read is what each format actually pulls off disk to answer the query — a CSV has to read all of it.

operation	CSV	Columna	win
File size on disk	2,592 KB	225 KB	91% smaller
Read 1 of 7 columns (projection)	2,592 KB	154 KB	94% less read
Filter amount > 4500 (predicate)	2,592 KB	47 KB	98% less read
Row groups touched by that filter	all	3 / 10	7 skipped

CSV must scan the entire file for any query; Columna seeks to the column chunks it needs and skips row groups it can prove are empty. → Open the generated file inspector to see this on a real file.

Browse the code on GitHub → Live inspector