fast-suffix-array¶
Suffix arrays for every platform: Rust, Python, JavaScript/WASM, and the command line.
The suffix array is the foundational data structure behind full-text search, data compression, bioinformatics, and duplicate detection. This library provides O(n) suffix array construction in pure Rust, compiling to both native code and WebAssembly.
A single pip install or npm install gives you suffix arrays, LCP arrays, FM-index, BWT, longest common substring, duplicate phrase detection, LPF arrays, and sentence splitting. Ships as a Rust crate (fast-suffix-array-core), Python package (fast-suffix-array via PyO3), npm/WASM package (fast-suffix-array), and CLI tool (fsa).
It is the only WASM suffix array on npm (12-20x faster than mnemonist) and the only WASM FM-index on npm (1,074x faster than indexOf for multi-pattern search). On PyPI, it sits alongside math-hiyoko's fm-index as one of two actively maintained FM-indexes, differentiating on breadth: SA, LCP, BWT, LCS, LPF, duplicate detection, and WASM browser support in a single package. It is also the only library in any language for word-boundary-aware duplicate phrase detection.
The substring gap¶
Suffix-array-based search finds arbitrary substrings that word-tokenizing search libraries miss. Every popular JavaScript search library -- fuse.js (8.4M downloads/week), lunr (5.6M), minisearch (900K) -- tokenizes text into words. They answer "which documents contain these words?" but cannot find substrings like "rown fox" or "ATTA" (a DNA motif inside GATTACA).
| Query | fuse.js | lunr | minisearch | FM-index |
|---|---|---|---|---|
"rown fox" |
MISS | FOUND | FOUND | FOUND |
"ATTA" |
MISS | MISS | MISS | FOUND |
"Script dev" |
MISS | MISS | MISS | FOUND |
"ows-Whe" |
MISS | MISS | MISS | FOUND |
"ix arr" |
MISS | MISS | MISS | FOUND |
"uick brown f" |
MISS | FOUND | FOUND | FOUND |
| Score | 0/6 | 2/6 | 2/6 | 6/6 |
Performance at a glance¶
Install¶
Platforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).
Quick start¶
// FM-index: compressed full-text index with O(p log sigma) queries
const { WasmFMIndex } = require('fast-suffix-array');
const index = new WasmFMIndex(text, 2);
index.count('pattern'); // O(p log sigma) time
index.locate('pattern'); // Uint32Array of positions
index.heapSize(); // ~1.8x text size (vs 5-9x for suffix array)
Headline benchmarks¶
All benchmarks on Apple Silicon. See detailed benchmarks for full tables and methodology.
| Metric | vs | Context |
|---|---|---|
| 1,650x faster | Python str.count() |
FM-index, 100 patterns on 10MB text |
| 55,040x faster | Python bytes.count() |
FM-index, 19 DNA motifs on 50MB genome |
| 27,008x faster | pandas str.count() |
FM-index, 50 patterns on 500K rows |
| 1,074x faster | JS String.indexOf() |
WASM FM-index, 100 patterns on 10MB |
| 721x faster | modern-ahocorasick | WASM FM-index, 100 patterns on 10MB |
| 111x faster | ahocorasick-rs | FM-index, 200 patterns on 10MB |
| 489x faster | pylcs | Longest common substring, 10KB |
| 20x faster | mnemonist | WASM SA construction, 1M chars |
| 64-68% smaller | suffix array + text | FM-index memory: ~1.8x text vs 5-9x |
| 125 queries | break-even vs str.count() | On 1MB text; fewer for larger texts |
Feature overview¶
| Capability | What it does | Complexity | Status |
|---|---|---|---|
| Suffix array | O(n) SA-IS, 3 backends (pure Rust, divsufsort, libsais) | O(n) | Only WASM SA on npm. 20x faster than mnemonist. |
| FM-index | Compressed full-text search, count/locate, self-index | O(p log sigma) count | Only FM-index on npm. On PyPI alongside math-hiyoko. |
| Duplicate detection | Word-boundary phrases, Aho-Corasick dedup, block detection | O(n * w) | Only library for this in any language. |
| String analysis | LCS, distinct substrings, longest repeated, LPF | O(n+m) / O(n) | 489x faster than pylcs for LCS. |
| BWT | Forward/inverse Burrows-Wheeler transform | O(n) | Enables compression pipelines. |
| Tokenization | Sentence splitting, word boundaries, 50+ abbreviations | O(n) | CJK, Devanagari, Arabic, Ethiopic, Myanmar. |
Learn more¶
- Benchmarks -- full benchmark tables for every comparison
- API Reference -- complete API for Python, JS/WASM, CLI, and Rust
- Guide -- when to use what, and how the algorithms work
- Algorithm Deep Dive -- SA-IS, Kasai's LCP, Aho-Corasick dedup, and the 166× optimization story
- Use Cases -- bioinformatics, browser search, writing tools, data pipelines, and more
License¶
MIT