Skip to content

Guide

When to use what

FM-index

Use when you're searching for many patterns in the same text, need true substring search (not just word-level), the text is large (>10KB), or you want to discard the original text (self-index). The more queries you run, the more the one-time build cost is amortized.

from fast_suffix_array import FMIndex

# Build once (~1s for 10MB)
index = FMIndex(corpus)

# Each query: O(p log sigma) time, regardless of corpus size
for motif in thousand_motifs:
    n = index.count(motif)  # microseconds per query

Don't use when you're searching for fewer than ~130 patterns on a single text, the text changes between queries (must rebuild), you need regex (exact substring only), or the text is very small (<10KB).

Suffix array directly

Use when you need the raw SA/LCP for custom algorithms, one-shot string analysis (LCS, distinct substring count), BWT pipelines, or data compression applications.

Duplicate detection

Use when finding repeated phrases in manuscripts, documentation, note collections, or LLM training data. The same O(n) suffix array algorithm Google uses to deduplicate training corpora -- made accessible via pip/npm.

SA vs FM-index at a glance

Suffix Array FM-Index
Pattern count O(p log n) O(p log sigma)
Pattern locate O(p log n + occ) O(p + occ * s)
Space 5-9n bytes (SA + text) ~1.8n bytes
Original text needed? Yes No (self-index)
Construction O(n) O(n)

How the FM-index works

The FM-index combines three data structures:

1. BWT (Burrows-Wheeler Transform)

A reversible permutation of the text that groups similar contexts together.

2. Wavelet matrix

A succinct data structure over the BWT that answers "how many times does character c appear in BWT[0..k]?" in O(log sigma) time. This is the key to fast backward search.

3. Sampled suffix array

Stores every s-th suffix array position. To recover the actual text position of a match, the index walks backward via the BWT until it hits a sampled position. The locate_level parameter controls this tradeoff: lower = faster locate but more memory.

Processes the pattern right-to-left, one character per step, narrowing a range in the BWT. Each step requires one wavelet matrix rank query (O(log sigma)), so counting all occurrences of a length-p pattern takes O(p log sigma) time -- regardless of text length.


How the duplicate detection pipeline works

The pipeline combines several algorithms into an 8-stage process. For the full technical deep dive with pseudocode, complexity proofs, and optimization history, see the Algorithm Deep Dive.

Stage 1: SA-IS construction (O(n))

Builds suffix array in linear time. Pure Rust, works everywhere including WASM. Optional libsais/divsufsort backends for native speed.

Stage 2: Kasai's LCP (O(n))

Computes longest common prefix array.

Stage 3: Boundary precomputation (O(n))

Identifies word starts/ends and sentence boundaries. Handles abbreviations (Mr., Dr., U.S.A.), decimal numbers, URLs, CJK punctuation, and ellipsis.

Stage 4: LCP walk (O(n x w))

Discovers repeated phrases by walking the LCP array once, only examining word-boundary-aligned positions. This replaces naive n-gram generation (which would be O(n x m) and produce millions of candidates).

Stage 5: Aho-Corasick deduplication (O(P x L))

Builds an automaton from all discovered phrases and batch-removes shorter phrases that are substrings of longer ones with equal or higher counts. On the King James Bible: 165ms vs 178 seconds for the naive O(P^2) approach.

Stage 6: Interval merging (O(P log P))

Combines adjacent phrases that always co-occur into longer, more meaningful units. Sorted linear pass: 7.9ms vs 37.8 seconds for the naive O(P^2) approach.

Stage 7: Block detection (optional, O(n))

Finds large duplicated paragraphs/sections beyond the n-gram window. Removes contained n-gram phrases that are redundant.

Stage 8: Byte-to-UTF-16 offset conversion

For VSCode/JavaScript interop.


Architecture

Repository structure

crates/core/                Rust core library — all algorithms live here
  src/
    suffix_array.rs         SA-IS construction, LCP, BWT, search, LCS
    fm_index.rs             FM-index wrapper (delegates to fm-index crate)
    duplicate_finder.rs     8-stage duplicate detection pipeline
    interval_merger.rs      Interval merging for adjacent co-occurring phrases
    sentence_splitter.rs    Sentence boundary detection
    tokenizer.rs            Word tokenization with Unicode support
    types.rs                PhraseResult, Occurrence, shared types
    lib.rs                  Public API surface
    bin/fsa.rs              CLI entry point

bindings/python/            PyO3 bindings — wraps core crate for pip
bindings/wasm/              wasm-bindgen bindings — wraps core for npm

benchmark/                  JavaScript benchmark scripts
bindings/python/benchmarks/ Python benchmark scripts
benchmark-results/          Committed JSON from canonical benchmark runs

docs/                       MkDocs Material documentation site
packages/                   npm package configuration

Build targets

Target Tool Command
Rust crate cargo cargo build --release
Python wheel maturin cd bindings/python && uv run maturin build --release
WASM (Node.js) wasm-pack wasm-pack build bindings/wasm --target nodejs
WASM (browser) wasm-pack wasm-pack build bindings/wasm --target web
CLI cargo cargo install fast-suffix-array-core --features cli
Docs site mkdocs mkdocs serve (dev) / mkdocs build (prod)