SearchLite — Full-Text Search Engine from Scratch

194

passing tests. Zero external dependencies.

A full-text search engine built entirely from scratch in Python. Inverted indexing, BM25 ranking, Porter stemming, recursive descent query parsing, faceted search, and persistent storage — all from the standard library.

View source →

Quick start

from searchlite import SearchEngine, Schema, TextField, KeywordField

engine = SearchEngine(schema=Schema(
    title=TextField(boost=2.0),
    body=TextField(),
    tags=KeywordField(faceted=True),
))

engine.add({
    "title": "Building Data Pipelines",
    "body": "A guide to ETL with Python and Apache Kafka",
    "tags": ["data-engineering", "python"],
})

# BM25-ranked results with highlighting
results = engine.search("python AND kafka")
for hit in results:
    print(hit.score, hit.highlight("body"))

What's under the hood

Inverted Index

Term → document mapping with position tracking, term frequencies, and field-level storage. Add, remove, and serialize document sets.

BM25 & TF-IDF Scoring

Okapi BM25 with tunable k1 and b parameters. Length normalization, field boosting, and diminishing-return term frequency saturation.

Porter Stemmer

All five steps of the original Porter stemming algorithm. "running" → "run", "connected" → "connect", "generalization" → "gener".

Query Parser

Recursive descent parser handling AND, OR, NOT, phrase queries, wildcards, field-specific search, boost operators, and parenthesized grouping.

Faceted Search

Multi-field facet counting with value filtering. Search for "python", get back tag distributions and filter by category — like Elasticsearch aggregations.

Persistent Storage

JSON-based segment files with metadata tracking and compaction. Save your index to disk, reload it later. Context-manager support for clean shutdown.

Query syntax

Query	What it does
`python data`	Implicit AND — both terms required
`python OR java`	Either term matches
`NOT python`	Exclude documents containing "python"
`"machine learning"`	Exact phrase with positional matching
`title:python`	Search only the title field
`title:"data science"`	Phrase in a specific field
`pyth*`	Prefix wildcard expansion
`python^2.0`	Boost a term's weight
`(python OR java) AND data`	Grouped boolean logic

Architecture

API Layer

SearchEngine

High-level interface — add, search, commit, stats

↓

Execution

Searcher

Query → postings → score → rank

Ranking

BM25 Scorer

TF saturation, IDF, length norm

Parsing

Query Parser

Recursive descent with precedence

Display

Highlighter

Best-passage snippet extraction

↓

Core

Inverted Index

Posting lists, positions, term lookup

Analysis

Analyzer

Tokenize → normalize → stem → filter

Schema

Field Types

Text, Keyword, Numeric definitions

Persistence

Storage

JSON segments with compaction