columna · generated inspector

Inside a .cna file

A real report generated from orders.cna by columna inspect. Every number below was read straight from the file's footer and pages — nothing is mocked.

5,000

rows

columns

row groups

44 KB

file size

259 KB

source CSV

44 KB

columna .cna

83%

smaller on disk

Byte layout

Footer-last, like Parquet: the writer streams row groups, then writes one compact metadata footer at the end. The reader seeks to the trailer, reads the footer, and now knows every column chunk's byte offset and stats — without scanning the data.

file header
MAGIC "CNA1" · 4 bytes

row group 0 · 500 rows · 7 column chunks → pages [encoding · codec · crc32]

row group 1 · 500 rows · 7 column chunks → pages [encoding · codec · crc32]

row group 2 · 500 rows · 7 column chunks → pages [encoding · codec · crc32]

row group 3 · 500 rows · 7 column chunks → pages [encoding · codec · crc32]

… 6 more row groups …

footer
zlib(JSON) — schema · per-chunk offsets · encodings · min/max stats

trailer
FOOTER_LEN (uint32) + MAGIC "CNA1" — seek from end

Per-column encoding

The writer trial-encodes each column with every applicable encoding and keeps the smallest. Here's what it actually chose for row group 0:

column	type	encoding	codec	min	max	why
order_id	int64	DELTA	DEFLATE	100000	100499	monotonic / clustered integers stored as small gaps
ts	datetime	BITPACK	DEFLATE	2026-01-01 08:02:00	2026-01-02 08:57:00	tight integer range packed into minimum bits
region	string	DICTIONARY	—	AMER	LATAM	low cardinality — values stored once, indices packed
status	string	DICTIONARY	DEFLATE	paid	refunded	low cardinality — values stored once, indices packed
amount	float64	PLAIN	DEFLATE	1.0	1059.83	high-entropy values, nothing smaller won
quantity	int64	BITPACK	—	1	12	tight integer range packed into minimum bits
note	string	DICTIONARY	—	follow-up	vip	low cardinality — values stored once, indices packed

Predicate pushdown

Query: scan(where = col("order_id") >= 104500). Each row group stores min/max for order_id in the footer, so groups whose max is below the threshold are skipped entirely — no bytes read, no decoding.

9/10

row groups skipped

500

rows actually scanned

500

rows matched

100%

fewer bytes read

Bar shows bytes read vs a full scan. Correctness is verified in the test suite against a brute-force filter — skipping never drops a matching row.

Data preview

order_id	ts	region	status	amount	quantity	note
100000	2026-01-01 08:02:00	EMEA	paid	1.0	9	None
100001	2026-01-01 08:03:00	EMEA	paid	148.82	9	None
100002	2026-01-01 08:07:00	APAC	paid	45.24	2	None
100003	2026-01-01 08:11:00	EMEA	paid	126.27	10	vip
100004	2026-01-01 08:14:00	EMEA	paid	150.44	10	urgent
100005	2026-01-01 08:17:00	EMEA	paid	149.5	4	vip
100006	2026-01-01 08:19:00	APAC	paid	499.11	9	vip
100007	2026-01-01 08:21:00	AMER	paid	169.36	9	urgent
100008	2026-01-01 08:26:00	APAC	paid	91.39	6	urgent
100009	2026-01-01 08:27:00	EMEA	paid	1.0	10	urgent
100010	2026-01-01 08:32:00	LATAM	paid	187.62	6	urgent
100011	2026-01-01 08:34:00	LATAM	paid	92.67	6	urgent

Generated by columna 0.1.0 · pure-Python columnar storage engine · github.com/hajirufai/columna