Guide¶
When to use what¶
FM-index¶
Use when you're searching for many patterns in the same text, need true substring search (not just word-level), the text is large (>10KB), or you want to discard the original text (self-index). The more queries you run, the more the one-time build cost is amortized.
from fast_suffix_array import FMIndex
# Build once (~1s for 10MB)
index = FMIndex(corpus)
# Each query: O(p log sigma) time, regardless of corpus size
for motif in thousand_motifs:
n = index.count(motif) # microseconds per query
Don't use when you're searching for fewer than ~130 patterns on a single text, the text changes between queries (must rebuild), you need regex (exact substring only), or the text is very small (<10KB).
Suffix array directly¶
Use when you need the raw SA/LCP for custom algorithms, one-shot string analysis (LCS, distinct substring count), BWT pipelines, or data compression applications.
Duplicate detection¶
Use when finding repeated phrases in manuscripts, documentation, note collections, or LLM training data. The same O(n) suffix array algorithm Google uses to deduplicate training corpora -- made accessible via pip/npm.
SA vs FM-index at a glance¶
| Suffix Array | FM-Index | |
|---|---|---|
| Pattern count | O(p log n) | O(p log sigma) |
| Pattern locate | O(p log n + occ) | O(p + occ * s) |
| Space | 5-9n bytes (SA + text) | ~1.8n bytes |
| Original text needed? | Yes | No (self-index) |
| Construction | O(n) | O(n) |
How the FM-index works¶
The FM-index combines three data structures:
1. BWT (Burrows-Wheeler Transform)¶
A reversible permutation of the text that groups similar contexts together.
2. Wavelet matrix¶
A succinct data structure over the BWT that answers "how many times does character c appear in BWT[0..k]?" in O(log sigma) time. This is the key to fast backward search.
3. Sampled suffix array¶
Stores every s-th suffix array position. To recover the actual text position of a match, the index walks backward via the BWT until it hits a sampled position. The locate_level parameter controls this tradeoff: lower = faster locate but more memory.
Backward search¶
Processes the pattern right-to-left, one character per step, narrowing a range in the BWT. Each step requires one wavelet matrix rank query (O(log sigma)), so counting all occurrences of a length-p pattern takes O(p log sigma) time -- regardless of text length.
How the duplicate detection pipeline works¶
The pipeline combines several algorithms into an 8-stage process. For the full technical deep dive with pseudocode, complexity proofs, and optimization history, see the Algorithm Deep Dive.
Stage 1: SA-IS construction (O(n))¶
Builds suffix array in linear time. Pure Rust, works everywhere including WASM. Optional libsais/divsufsort backends for native speed.
Stage 2: Kasai's LCP (O(n))¶
Computes longest common prefix array.
Stage 3: Boundary precomputation (O(n))¶
Identifies word starts/ends and sentence boundaries. Handles abbreviations (Mr., Dr., U.S.A.), decimal numbers, URLs, CJK punctuation, and ellipsis.
Stage 4: LCP walk (O(n x w))¶
Discovers repeated phrases by walking the LCP array once, only examining word-boundary-aligned positions. This replaces naive n-gram generation (which would be O(n x m) and produce millions of candidates).
Stage 5: Aho-Corasick deduplication (O(P x L))¶
Builds an automaton from all discovered phrases and batch-removes shorter phrases that are substrings of longer ones with equal or higher counts. On the King James Bible: 165ms vs 178 seconds for the naive O(P^2) approach.
Stage 6: Interval merging (O(P log P))¶
Combines adjacent phrases that always co-occur into longer, more meaningful units. Sorted linear pass: 7.9ms vs 37.8 seconds for the naive O(P^2) approach.
Stage 7: Block detection (optional, O(n))¶
Finds large duplicated paragraphs/sections beyond the n-gram window. Removes contained n-gram phrases that are redundant.
Stage 8: Byte-to-UTF-16 offset conversion¶
For VSCode/JavaScript interop.
Architecture¶
Repository structure¶
crates/core/ Rust core library — all algorithms live here
src/
suffix_array.rs SA-IS construction, LCP, BWT, search, LCS
fm_index.rs FM-index wrapper (delegates to fm-index crate)
duplicate_finder.rs 8-stage duplicate detection pipeline
interval_merger.rs Interval merging for adjacent co-occurring phrases
sentence_splitter.rs Sentence boundary detection
tokenizer.rs Word tokenization with Unicode support
types.rs PhraseResult, Occurrence, shared types
lib.rs Public API surface
bin/fsa.rs CLI entry point
bindings/python/ PyO3 bindings — wraps core crate for pip
bindings/wasm/ wasm-bindgen bindings — wraps core for npm
benchmark/ JavaScript benchmark scripts
bindings/python/benchmarks/ Python benchmark scripts
benchmark-results/ Committed JSON from canonical benchmark runs
docs/ MkDocs Material documentation site
packages/ npm package configuration
Build targets¶
| Target | Tool | Command |
|---|---|---|
| Rust crate | cargo | cargo build --release |
| Python wheel | maturin | cd bindings/python && uv run maturin build --release |
| WASM (Node.js) | wasm-pack | wasm-pack build bindings/wasm --target nodejs |
| WASM (browser) | wasm-pack | wasm-pack build bindings/wasm --target web |
| CLI | cargo | cargo install fast-suffix-array-core --features cli |
| Docs site | mkdocs | mkdocs serve (dev) / mkdocs build (prod) |