fast-suffix-array¶

Suffix arrays for every platform: Rust, Python, JavaScript/WASM, and the command line.

The suffix array is the foundational data structure behind full-text search, data compression, bioinformatics, and duplicate detection. This library provides O(n) suffix array construction in pure Rust, compiling to both native code and WebAssembly.

A single pip install or npm install gives you suffix arrays, LCP arrays, FM-index, BWT, longest common substring, duplicate phrase detection, LPF arrays, and sentence splitting. Ships as a Rust crate (fast-suffix-array-core), Python package (fast-suffix-array via PyO3), npm/WASM package (fast-suffix-array), and CLI tool (fsa).

It is the only WASM suffix array on npm (12-20x faster than mnemonist) and the only WASM FM-index on npm (1,074x faster than indexOf for multi-pattern search). On PyPI, it sits alongside math-hiyoko's fm-index as one of two actively maintained FM-indexes, differentiating on breadth: SA, LCP, BWT, LCS, LPF, duplicate detection, and WASM browser support in a single package. It is also the only library in any language for word-boundary-aware duplicate phrase detection.

The substring gap¶

Suffix-array-based search finds arbitrary substrings that word-tokenizing search libraries miss. Every popular JavaScript search library -- fuse.js (8.4M downloads/week), lunr (5.6M), minisearch (900K) -- tokenizes text into words. They answer "which documents contain these words?" but cannot find substrings like "rown fox" or "ATTA" (a DNA motif inside GATTACA).

Query	fuse.js	lunr	minisearch	FM-index
`"rown fox"`	MISS	FOUND	FOUND	FOUND
`"ATTA"`	MISS	MISS	MISS	FOUND
`"Script dev"`	MISS	MISS	MISS	FOUND
`"ows-Whe"`	MISS	MISS	MISS	FOUND
`"ix arr"`	MISS	MISS	MISS	FOUND
`"uick brown f"`	MISS	FOUND	FOUND	FOUND
Score	0/6	2/6	2/6	6/6

Performance at a glance¶

WASM suffix array vs mnemonist

FM-index query scaling in JavaScript/WASM

Install¶

PythonJavaScript / WASMRustCLI

pip install fast-suffix-array

npm install fast-suffix-array

[dependencies]
fast-suffix-array-core = "0.1"

cargo install fast-suffix-array-core --features cli

Platforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).

Quick start¶

PythonJavaScript / WASMCLIRust

# FM-index: build once, query in O(p log sigma) time
from fast_suffix_array import FMIndex

index = FMIndex(open("genome.fasta", "rb").read())
index.count(b"GATTACA")        # O(p log sigma) -- independent of text length
index.locate(b"GATTACA")       # numpy array of all positions

# Find all repeated phrases in a text
import fast_suffix_array

results = fast_suffix_array.find_duplicates(
    "the quick brown fox jumped over the lazy dog. "
    "the quick brown fox sat on the mat.",
    min_words=3,
)

for r in results:
    print(f'{r["count"]}x  ({r["number_of_words"]} words)  "{r["phrase"]}"')

// FM-index: compressed full-text index with O(p log sigma) queries
const { WasmFMIndex } = require('fast-suffix-array');
const index = new WasmFMIndex(text, 2);
index.count('pattern');     // O(p log sigma) time
index.locate('pattern');    // Uint32Array of positions
index.heapSize();            // ~1.8x text size (vs 5-9x for suffix array)

// Drop-in replacement for mnemonist (20x faster)
const SuffixArray = require('fast-suffix-array/suffix-array');
const sa = new SuffixArray('banana');
sa.array;   // [5, 3, 1, 0, 4, 2]

# Find duplicate phrases in a file
fsa scan manuscript.md --top 20

# JSON output for scripting
fsa scan manuscript.md --json

# Compare two files (longest common substring)
fsa lcs chapter1.md chapter2.md

# File statistics (word count, distinct substrings)
fsa info manuscript.md

use fast_suffix_array_core::duplicate_finder::process_test_string;

let results = process_test_string("", text, 4, 9, 50, 2, true);
for r in &results {
    println!("{}x  ({} words)  \"{}\"", r.count, r.number_of_words, r.phrase);
}

Headline benchmarks¶

All benchmarks on Apple Silicon. See detailed benchmarks for full tables and methodology.

Metric	vs	Context
1,650x faster	Python `str.count()`	FM-index, 100 patterns on 10MB text
55,040x faster	Python `bytes.count()`	FM-index, 19 DNA motifs on 50MB genome
27,008x faster	pandas `str.count()`	FM-index, 50 patterns on 500K rows
1,074x faster	JS `String.indexOf()`	WASM FM-index, 100 patterns on 10MB
721x faster	modern-ahocorasick	WASM FM-index, 100 patterns on 10MB
111x faster	ahocorasick-rs	FM-index, 200 patterns on 10MB
489x faster	pylcs	Longest common substring, 10KB
20x faster	mnemonist	WASM SA construction, 1M chars
64-68% smaller	suffix array + text	FM-index memory: ~1.8x text vs 5-9x
125 queries	break-even vs str.count()	On 1MB text; fewer for larger texts

Feature overview¶

Capability	What it does	Complexity	Status
Suffix array	O(n) SA-IS, 3 backends (pure Rust, divsufsort, libsais)	O(n)	Only WASM SA on npm. 20x faster than mnemonist.
FM-index	Compressed full-text search, count/locate, self-index	O(p log sigma) count	Only FM-index on npm. On PyPI alongside math-hiyoko.
Duplicate detection	Word-boundary phrases, Aho-Corasick dedup, block detection	O(n * w)	Only library for this in any language.
String analysis	LCS, distinct substrings, longest repeated, LPF	O(n+m) / O(n)	489x faster than pylcs for LCS.
BWT	Forward/inverse Burrows-Wheeler transform	O(n)	Enables compression pipelines.
Tokenization	Sentence splitting, word boundaries, 50+ abbreviations	O(n)	CJK, Devanagari, Arabic, Ethiopic, Myanmar.

Learn more¶

Benchmarks -- full benchmark tables for every comparison
API Reference -- complete API for Python, JS/WASM, CLI, and Rust
Guide -- when to use what, and how the algorithms work
Algorithm Deep Dive -- SA-IS, Kasai's LCP, Aho-Corasick dedup, and the 166× optimization story
Use Cases -- bioinformatics, browser search, writing tools, data pipelines, and more

License¶

MIT