Skip to content

fast-suffix-array

Suffix arrays for every platform: Rust, Python, JavaScript/WASM, and the command line.

Crates.io npm PyPI License: MIT


The suffix array is the foundational data structure behind full-text search, data compression, bioinformatics, and duplicate detection. This library provides O(n) suffix array construction in pure Rust, compiling to both native code and WebAssembly.

A single pip install or npm install gives you suffix arrays, LCP arrays, FM-index, BWT, longest common substring, duplicate phrase detection, LPF arrays, and sentence splitting. Ships as a Rust crate (fast-suffix-array-core), Python package (fast-suffix-array via PyO3), npm/WASM package (fast-suffix-array), and CLI tool (fsa).

It is the only WASM suffix array on npm (12-20x faster than mnemonist) and the only WASM FM-index on npm (1,074x faster than indexOf for multi-pattern search). On PyPI, it sits alongside math-hiyoko's fm-index as one of two actively maintained FM-indexes, differentiating on breadth: SA, LCP, BWT, LCS, LPF, duplicate detection, and WASM browser support in a single package. It is also the only library in any language for word-boundary-aware duplicate phrase detection.

The substring gap

Suffix-array-based search finds arbitrary substrings that word-tokenizing search libraries miss. Every popular JavaScript search library -- fuse.js (8.4M downloads/week), lunr (5.6M), minisearch (900K) -- tokenizes text into words. They answer "which documents contain these words?" but cannot find substrings like "rown fox" or "ATTA" (a DNA motif inside GATTACA).

Query fuse.js lunr minisearch FM-index
"rown fox" MISS FOUND FOUND FOUND
"ATTA" MISS MISS MISS FOUND
"Script dev" MISS MISS MISS FOUND
"ows-Whe" MISS MISS MISS FOUND
"ix arr" MISS MISS MISS FOUND
"uick brown f" MISS FOUND FOUND FOUND
Score 0/6 2/6 2/6 6/6

Performance at a glance

WASM suffix array vs mnemonist

FM-index query scaling in JavaScript/WASM

Install

pip install fast-suffix-array
npm install fast-suffix-array
[dependencies]
fast-suffix-array-core = "0.1"
cargo install fast-suffix-array-core --features cli

Platforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).

Quick start

# FM-index: build once, query in O(p log sigma) time
from fast_suffix_array import FMIndex

index = FMIndex(open("genome.fasta", "rb").read())
index.count(b"GATTACA")        # O(p log sigma) -- independent of text length
index.locate(b"GATTACA")       # numpy array of all positions
# Find all repeated phrases in a text
import fast_suffix_array

results = fast_suffix_array.find_duplicates(
    "the quick brown fox jumped over the lazy dog. "
    "the quick brown fox sat on the mat.",
    min_words=3,
)

for r in results:
    print(f'{r["count"]}x  ({r["number_of_words"]} words)  "{r["phrase"]}"')
// FM-index: compressed full-text index with O(p log sigma) queries
const { WasmFMIndex } = require('fast-suffix-array');
const index = new WasmFMIndex(text, 2);
index.count('pattern');     // O(p log sigma) time
index.locate('pattern');    // Uint32Array of positions
index.heapSize();            // ~1.8x text size (vs 5-9x for suffix array)
// Drop-in replacement for mnemonist (20x faster)
const SuffixArray = require('fast-suffix-array/suffix-array');
const sa = new SuffixArray('banana');
sa.array;   // [5, 3, 1, 0, 4, 2]
# Find duplicate phrases in a file
fsa scan manuscript.md --top 20

# JSON output for scripting
fsa scan manuscript.md --json

# Compare two files (longest common substring)
fsa lcs chapter1.md chapter2.md

# File statistics (word count, distinct substrings)
fsa info manuscript.md
use fast_suffix_array_core::duplicate_finder::process_test_string;

let results = process_test_string("", text, 4, 9, 50, 2, true);
for r in &results {
    println!("{}x  ({} words)  \"{}\"", r.count, r.number_of_words, r.phrase);
}

Headline benchmarks

All benchmarks on Apple Silicon. See detailed benchmarks for full tables and methodology.

Metric vs Context
1,650x faster Python str.count() FM-index, 100 patterns on 10MB text
55,040x faster Python bytes.count() FM-index, 19 DNA motifs on 50MB genome
27,008x faster pandas str.count() FM-index, 50 patterns on 500K rows
1,074x faster JS String.indexOf() WASM FM-index, 100 patterns on 10MB
721x faster modern-ahocorasick WASM FM-index, 100 patterns on 10MB
111x faster ahocorasick-rs FM-index, 200 patterns on 10MB
489x faster pylcs Longest common substring, 10KB
20x faster mnemonist WASM SA construction, 1M chars
64-68% smaller suffix array + text FM-index memory: ~1.8x text vs 5-9x
125 queries break-even vs str.count() On 1MB text; fewer for larger texts

Feature overview

Capability What it does Complexity Status
Suffix array O(n) SA-IS, 3 backends (pure Rust, divsufsort, libsais) O(n) Only WASM SA on npm. 20x faster than mnemonist.
FM-index Compressed full-text search, count/locate, self-index O(p log sigma) count Only FM-index on npm. On PyPI alongside math-hiyoko.
Duplicate detection Word-boundary phrases, Aho-Corasick dedup, block detection O(n * w) Only library for this in any language.
String analysis LCS, distinct substrings, longest repeated, LPF O(n+m) / O(n) 489x faster than pylcs for LCS.
BWT Forward/inverse Burrows-Wheeler transform O(n) Enables compression pipelines.
Tokenization Sentence splitting, word boundaries, 50+ abbreviations O(n) CJK, Devanagari, Arabic, Ethiopic, Myanmar.

Learn more

  • Benchmarks -- full benchmark tables for every comparison
  • API Reference -- complete API for Python, JS/WASM, CLI, and Rust
  • Guide -- when to use what, and how the algorithms work
  • Algorithm Deep Dive -- SA-IS, Kasai's LCP, Aho-Corasick dedup, and the 166× optimization story
  • Use Cases -- bioinformatics, browser search, writing tools, data pipelines, and more

License

MIT