Pure-Python · zero dependencies · ~3k LOC

A columnar file format you can actually read.

Columna is a tiny, from-scratch reimplementation of the ideas behind Parquet — encodings, compression, row groups, and predicate/projection pushdown — in plain Python. No pandas. No pyarrow. No numpy. Just struct and zlib.

See a live file inspector → Read the source
orders.cna · byte layoutfooter-last, like Parquet
0x0000
file header
MAGIC "CNA1" · 4 bytes
0x0004
row group 0 · 500 rows
order_id DELTA ts DELTA region DICT status RLE amount PLAIN quantity BITPACK note DICT+nulls
row groups 1 … 9 · column chunks → pages [encoding · codec · crc32]
each page is self-describing & CRC-checked codec = NONE / DEFLATE, whichever is smaller
tail−N
tail
trailer
FOOTER_LEN (uint32) + MAGIC · seek from the end
91%
smaller than CSV
98%
fewer bytes / query
7/10
row groups skipped
0
third-party deps

Numbers from bench/benchmark.py on a 50,000-row synthetic orders dataset. Reproduce with python bench/benchmark.py.

Why columnar

Row storage reads everything. Columnar reads only what the query needs.

A CSV stores values row by row, so a query touching one column still drags the whole file off disk. Columna stores each column's values together, compressed and encoded, in independent row groups. The footer at the end of the file holds every column chunk's byte offset and min/max stats — so the reader knows exactly where to seek, and which chunks it can ignore entirely, before reading a single value.

# write a CSV straight to a columnar .cna file
import columna as cna
from columna import col

cna.write_csv("orders.csv", "orders.cna", row_group_size=5000)

# projection pushdown — only these two column chunks are read off disk
table = cna.read("orders.cna", columns=["order_id", "amount"])

# predicate pushdown — whole row groups skipped via footer min/max stats
result = cna.scan("orders.cna",
                  columns=["order_id", "amount"],
                  where=(col("amount") > 4500) & (col("region") == "EMEA"))

print(result.summary())
# 343 matched / 6000 scanned, 7/10 groups skipped, 21734 bytes read
The heart of it

Five encodings. The writer trial-runs each and keeps the smallest.

No heuristics that can pick a worse encoding than the baseline — every column is actually encoded each way, and the fewest bytes wins. Then an optional DEFLATE pass mops up any byte-level redundancy left over.

DELTA

Monotonic integers → tiny gaps

Auto-increment ids and timestamps have huge values but small gaps. Store the first value, then zig-zag varints of each delta. A run like 100000, 100001, 100003 collapses to one full int plus single-byte deltas.

order_id, ts → DELTA + varint
DICTIONARY

Low-cardinality → index stream

Four distinct regions across 5,000 rows? Store the four values once, replace the column with small integer indices, then bit-pack (or RLE) the indices. The classic categorical win.

region, note → DICT + bitpack
RLE

Long runs → (value, length)

Clustered status flags compress to a handful of (value, run_length) pairs instead of thousands of repeats.

status → RLE
BITPACK

Tight ranges → minimum bits

Subtract the column min, then pack residuals into the fewest bits that fit. Quantities of 1–12 need 4 bits each, not a whole byte.

quantity → BITPACK (4 bits)
NULLS

Definition bitmaps — one bit per row, zero wasted value bytes

Nullable columns store only their non-null values; a separate validity bitmap (1 bit/row) records which positions are null, exactly like Parquet's definition levels. An all-null column costs ~1 bit/row and zero value bytes. A column with no nulls skips the bitmap entirely.

The payoff

Predicate pushdown that's honest about what it skips.

Each row group records min/max per column in the footer. For amount > 4500, any group whose max amount is ≤ 4500 can't contain a match — so it's skipped without decoding a single byte. If a column's stats are missing or NaN-polluted, Columna refuses to skip: correctness always beats speed. The whole thing is verified in the test suite against a brute-force filter on random data — skipping never drops a matching row.

# can_skip proves a group is empty using only footer stats
def can_skip(self, stats):
    mn, mx = stats[self.column]["min"], stats[self.column]["max"]
    if mn is None or mx is None:
        return False          # unreliable stats → never skip
    if self.op == ">":  return mx <= self.value
    if self.op == "<":  return mn >= self.value
    if self.op == "==": return self.value < mn or self.value > mx
Real numbers, reproducible

Columna vs CSV on 50,000 rows.

Generated by the committed benchmark script. Bytes-read is what each format actually pulls off disk to answer the query — a CSV has to read all of it.

operationCSVColumnawin
File size on disk2,592 KB225 KB91% smaller
Read 1 of 7 columns (projection)2,592 KB154 KB94% less read
Filter amount > 4500 (predicate)2,592 KB47 KB98% less read
Row groups touched by that filterall3 / 107 skipped

CSV must scan the entire file for any query; Columna seeks to the column chunks it needs and skips row groups it can prove are empty. → Open the generated file inspector to see this on a real file.

Browse the code on GitHub → Live inspector