passing tests. Zero external dependencies.
A full-text search engine built entirely from scratch in Python. Inverted indexing, BM25 ranking, Porter stemming, recursive descent query parsing, faceted search, and persistent storage — all from the standard library.
View source →from searchlite import SearchEngine, Schema, TextField, KeywordField engine = SearchEngine(schema=Schema( title=TextField(boost=2.0), body=TextField(), tags=KeywordField(faceted=True), )) engine.add({ "title": "Building Data Pipelines", "body": "A guide to ETL with Python and Apache Kafka", "tags": ["data-engineering", "python"], }) # BM25-ranked results with highlighting results = engine.search("python AND kafka") for hit in results: print(hit.score, hit.highlight("body"))
Term → document mapping with position tracking, term frequencies, and field-level storage. Add, remove, and serialize document sets.
Okapi BM25 with tunable k1 and b parameters. Length normalization, field boosting, and diminishing-return term frequency saturation.
All five steps of the original Porter stemming algorithm. "running" → "run", "connected" → "connect", "generalization" → "gener".
Recursive descent parser handling AND, OR, NOT, phrase queries, wildcards, field-specific search, boost operators, and parenthesized grouping.
Multi-field facet counting with value filtering. Search for "python", get back tag distributions and filter by category — like Elasticsearch aggregations.
JSON-based segment files with metadata tracking and compaction. Save your index to disk, reload it later. Context-manager support for clean shutdown.
| Query | What it does |
|---|---|
python data | Implicit AND — both terms required |
python OR java | Either term matches |
NOT python | Exclude documents containing "python" |
"machine learning" | Exact phrase with positional matching |
title:python | Search only the title field |
title:"data science" | Phrase in a specific field |
pyth* | Prefix wildcard expansion |
python^2.0 | Boost a term's weight |
(python OR java) AND data | Grouped boolean logic |
High-level interface — add, search, commit, stats
Query → postings → score → rank
TF saturation, IDF, length norm
Recursive descent with precedence
Best-passage snippet extraction
Posting lists, positions, term lookup
Tokenize → normalize → stem → filter
Text, Keyword, Numeric definitions
JSON segments with compaction