Columna is a tiny, from-scratch reimplementation of the ideas behind Parquet — encodings, compression, row groups, and predicate/projection pushdown — in plain Python. No pandas. No pyarrow. No numpy. Just struct and zlib.
Numbers from bench/benchmark.py on a 50,000-row synthetic orders dataset. Reproduce with python bench/benchmark.py.
A CSV stores values row by row, so a query touching one column still drags the whole file off disk. Columna stores each column's values together, compressed and encoded, in independent row groups. The footer at the end of the file holds every column chunk's byte offset and min/max stats — so the reader knows exactly where to seek, and which chunks it can ignore entirely, before reading a single value.
# write a CSV straight to a columnar .cna file import columna as cna from columna import col cna.write_csv("orders.csv", "orders.cna", row_group_size=5000) # projection pushdown — only these two column chunks are read off disk table = cna.read("orders.cna", columns=["order_id", "amount"]) # predicate pushdown — whole row groups skipped via footer min/max stats result = cna.scan("orders.cna", columns=["order_id", "amount"], where=(col("amount") > 4500) & (col("region") == "EMEA")) print(result.summary()) # 343 matched / 6000 scanned, 7/10 groups skipped, 21734 bytes read
No heuristics that can pick a worse encoding than the baseline — every column is actually encoded each way, and the fewest bytes wins. Then an optional DEFLATE pass mops up any byte-level redundancy left over.
Auto-increment ids and timestamps have huge values but small gaps. Store the first value, then zig-zag varints of each delta. A run like 100000, 100001, 100003 collapses to one full int plus single-byte deltas.
Four distinct regions across 5,000 rows? Store the four values once, replace the column with small integer indices, then bit-pack (or RLE) the indices. The classic categorical win.
Clustered status flags compress to a handful of (value, run_length) pairs instead of thousands of repeats.
Subtract the column min, then pack residuals into the fewest bits that fit. Quantities of 1–12 need 4 bits each, not a whole byte.
Nullable columns store only their non-null values; a separate validity bitmap (1 bit/row) records which positions are null, exactly like Parquet's definition levels. An all-null column costs ~1 bit/row and zero value bytes. A column with no nulls skips the bitmap entirely.
Each row group records min/max per column in the footer. For amount > 4500, any group whose max amount is ≤ 4500 can't contain a match — so it's skipped without decoding a single byte. If a column's stats are missing or NaN-polluted, Columna refuses to skip: correctness always beats speed. The whole thing is verified in the test suite against a brute-force filter on random data — skipping never drops a matching row.
# can_skip proves a group is empty using only footer stats def can_skip(self, stats): mn, mx = stats[self.column]["min"], stats[self.column]["max"] if mn is None or mx is None: return False # unreliable stats → never skip if self.op == ">": return mx <= self.value if self.op == "<": return mn >= self.value if self.op == "==": return self.value < mn or self.value > mx
Generated by the committed benchmark script. Bytes-read is what each format actually pulls off disk to answer the query — a CSV has to read all of it.
| operation | CSV | Columna | win |
|---|---|---|---|
| File size on disk | 2,592 KB | 225 KB | 91% smaller |
| Read 1 of 7 columns (projection) | 2,592 KB | 154 KB | 94% less read |
| Filter amount > 4500 (predicate) | 2,592 KB | 47 KB | 98% less read |
| Row groups touched by that filter | all | 3 / 10 | 7 skipped |
CSV must scan the entire file for any query; Columna seeks to the column chunks it needs and skips row groups it can prove are empty. → Open the generated file inspector to see this on a real file.