API Reference
Getting started with the FM-index
Python
Build an index once, then run as many queries as you like in O(p log sigma) time:
from fast_suffix_array import FMIndex
# Index accepts bytes — encode your text first
data = open("corpus.txt", "rb").read()
index = FMIndex(data)
# Count occurrences (returns int)
n = index.count(b"search term")
# Locate all positions (returns numpy array of u64)
positions = index.locate(b"search term")
# Check index size
print(f"Index uses {index.heap_size():,} bytes for {len(index):,} bytes of text")
JavaScript / WASM (Node.js)
const { WasmFMIndex } = require('fast-suffix-array');
const text = 'your corpus text here...';
const index = new WasmFMIndex(text, 2); // 2 = locate sampling level
console.log(index.count('pattern')); // number
console.log(index.locate('pattern')); // Uint32Array
console.log(`${index.heapSize()} bytes`);
JavaScript / WASM (Browser / ESM)
<script type="module">
import init, { WasmFMIndex } from 'fast-suffix-array';
await init(); // fetches .wasm from same directory
const index = new WasmFMIndex(text, 2);
console.log(index.count('pattern'));
</script>
The init() function accepts:
- No argument: fetches
.wasm from the same directory via import.meta.url
string or URL: fetches from the given URL
Response or Promise<Response>: uses streaming instantiation
ArrayBuffer or Uint8Array: compiles from raw bytes
WebAssembly.Module: instantiates directly
// Obsidian plugin example — provide the WASM bytes yourself
import init, { WasmFMIndex } from 'fast-suffix-array';
const wasmBytes = await this.app.vault.adapter.readBinary('plugins/my-plugin/fast_suffix_array_bg.wasm');
await init(wasmBytes);
Node.js vs Browser
The npm package ships dual entry points. Node.js (require) loads the WASM synchronously via fs.readFileSync. Browser/ESM (import) requires calling init() first. The explicit fast-suffix-array/browser export is also available for bundlers that don't resolve the import condition.
Python
FMIndex class
from fast_suffix_array import FMIndex
index = FMIndex(data: bytes, locate_level=2)
| Method |
Returns |
Description |
index.count(pattern: bytes) |
int |
Count occurrences in O(p log sigma) time |
index.locate(pattern: bytes) |
numpy.ndarray (u64) |
All occurrence positions |
len(index) |
int |
Text length |
index.heap_size() |
int |
Bytes used by index |
Duplicate detection
find_duplicates(
text,
min_words=4,
min_chars=9,
max_words=50,
min_words_in_substring=2,
enable_block_detection=False,
)
# -> list of dicts: {phrase, count, string_length, number_of_words, occurrences}
String analysis
longest_common_substring(text1, text2) # -> str
distinct_substring_count(text) # -> int
Core suffix array operations
suffix_array(data: bytes, backend=None) # -> numpy.ndarray (u64)
lcp_array(data: bytes) # -> numpy.ndarray (u64)
count(data: bytes, pattern: bytes) # -> int
locate(data: bytes, pattern: bytes) # -> numpy.ndarray (u64)
search_range(data: bytes, pattern: bytes)# -> tuple (lo, hi)
bwt(data: bytes) # -> (bytes, int)
inverse_bwt(bwt: bytes, primary_index: int) # -> bytes
inverse_suffix_array(data: bytes) # -> numpy.ndarray (u64)
Serialization
sa_to_bytes(data: bytes) # -> bytes
sa_from_bytes(data: bytes) # -> numpy.ndarray (u64)
JavaScript / WASM
WasmFMIndex class
const { WasmFMIndex } = require('fast-suffix-array');
const index = new WasmFMIndex(text, locateLevel);
| Method / Property |
Returns |
Description |
index.count('pattern') |
number |
Count occurrences in O(p log sigma) time |
index.locate('pattern') |
Uint32Array |
All occurrence positions |
WasmFMIndex.fromBytes(uint8Array, locateLevel) |
WasmFMIndex |
Construct from raw bytes |
index.length |
number |
Text length |
index.heapSize() |
number |
Bytes used by index |
Duplicate detection
process_test_string(
sanitized, // sanitized_text (empty string = use source_text)
source, // source text
minWords, // min phrase length (words)
minChars, // min string length (chars)
maxWords, // max phrase length (words)
minSubWords, // min words in substring
blocks, // enable block detection (boolean)
)
String analysis
longest_common_substring(text1, text2) // -> string
distinct_substring_count(text) // -> number
longest_repeated_substring(text) // -> string
Suffix array (UTF-16 for JS compat)
suffix_array_u16(text) // -> Uint32Array
lcp_array_u16(text) // -> Uint32Array
search_pattern(text, pattern) // -> Uint32Array (positions)
Binary data (Uint8Array in, Uint32Array out)
suffix_array_bytes(data)
lcp_array_bytes(data)
search_pattern_bytes(data, pattern)
count_pattern_bytes(data, pattern)
burrows_wheeler_transform(text) // -> string
bwt_bytes(data) // -> Uint8Array
bwt_with_index(data) // -> Uint8Array (use with bwt_primary_index for roundtrip)
bwt_primary_index(data) // -> number
inverse_bwt_bytes(bwt, primaryIndex) // -> Uint8Array
BWT roundtrip
To reconstruct original data from its BWT, use bwt_primary_index to get the index needed by inverse_bwt_bytes:
const idx = bwt_primary_index(data);
const bwt = bwt_bytes(data);
const original = inverse_bwt_bytes(bwt, idx);
Mnemonist-compatible class
const SuffixArray = require('fast-suffix-array/suffix-array');
const sa = new SuffixArray('banana');
sa.array; // [5, 3, 1, 0, 4, 2]
sa.length; // 6
const GeneralizedSuffixArray = SuffixArray.GeneralizedSuffixArray;
const gsa = new GeneralizedSuffixArray(['a', 'b']);
gsa.longestCommonSubsequence();
CLI
fsa scan <file> [--min-words 4] [--min-chars 9] [--max-words 50] [--min-count 2] [--top N] [--json] [--no-blocks] [--quiet]
fsa search <file> <pattern> [--bytes] [--json] [--backend auto|sais|divsufsort|libsais]
fsa count <file> <pattern> [--bytes] [--backend auto|sais|divsufsort|libsais]
fsa lcs <file1> <file2>
fsa info <file> [--json]
fsa sa <file> [--bytes] [--backend auto|sais|divsufsort|libsais] [--json]
fsa lcp <file> [--bytes] [--backend auto|sais|divsufsort|libsais] [--json]
fsa bwt <file> [--bytes]
| Subcommand |
Description |
scan |
Find duplicate phrases in a file |
search |
Search for a pattern and return positions |
count |
Count occurrences of a pattern |
lcs |
Longest common substring between two files |
info |
File statistics (word count, distinct substrings) |
sa |
Build and display suffix array |
lcp |
Build and display LCP array |
bwt |
Compute Burrows-Wheeler transform |
Rust
[dependencies]
fast-suffix-array-core = "0.1"
| Module |
Key exports |
suffix_array |
build_suffix_array, build_lcp_array, sa_count, sa_locate, burrows_wheeler_transform_bytes, inverse_bwt, longest_common_substring_str, distinct_substring_count_str, SaBackend |
fm_index |
FMIndexBuilder, FMIndexFull, FMIndexCountOnly, FMIndexSearchable (requires fm-index-backend feature) |
duplicate_finder |
process_test_string |
sentence_splitter |
split_into_sentences, sentence_boundary_offsets |
tokenizer |
tokenize, word_start_offsets, word_end_offsets |
types |
PhraseResult, PhraseData, Occurrence |
interval_merger |
merge_intervals — merge overlapping character ranges (used internally by duplicate detection) |
Core suffix array operations
use fast_suffix_array_core::suffix_array;
let text = "banana";
let sa = suffix_array::build_suffix_array(text);
let lcp = suffix_array::build_lcp_array(text, &sa);
// Pattern search — needs pre-built SA
let count = suffix_array::sa_count(text.as_bytes(), &sa, b"ana");
let positions = suffix_array::sa_locate(text.as_bytes(), &sa, b"ana");
// Binary data works too
let data: &[u8] = b"\xDE\xAD\xBE\xEF";
let sa_bytes = suffix_array::build_suffix_array_bytes(data);
Backend selection
use fast_suffix_array_core::suffix_array::SaBackend;
// Use a specific backend (requires the corresponding feature)
let sa = suffix_array::build_suffix_array_with_backend(text, SaBackend::DivSufSort);
// Auto picks the fastest available backend
let sa = suffix_array::build_suffix_array_with_backend(text, SaBackend::Auto);
let bwt = suffix_array::burrows_wheeler_transform_bytes(text.as_bytes(), &sa);
// Use _with_index variant to get the primary index needed for inversion
let (bwt, idx) = suffix_array::burrows_wheeler_transform_bytes_with_index(text.as_bytes(), &sa);
let original = suffix_array::inverse_bwt(&bwt, idx);
FM-Index (requires fm-index-backend feature)
use fast_suffix_array_core::fm_index::{FMIndexBuilder, FMIndexSearchable};
let index = FMIndexBuilder::from_text("your corpus text")
.locate_level(2)
.build() // returns Result<FMIndexFull, FMIndexError>
.unwrap();
let count = index.count(b"pattern"); // via FMIndexSearchable trait
let positions = index.locate(b"pattern"); // Vec<usize>
String analysis
let lcs = suffix_array::longest_common_substring_str("abcdef", "cdefgh");
let distinct = suffix_array::distinct_substring_count_str("banana");
Cargo features
| Feature |
Description |
fm-index-backend |
FM-index support: compressed full-text index with O(p log sigma) pattern matching. ~1.8x text size in memory. |
divsufsort-backend |
~1.3-2.5x faster SA construction. Wraps libdivsufsort. No OpenMP. < 2 GB limit. Not WASM-compatible. Enabled by default. |
libsais-backend |
~2-4x faster SA construction. Wraps libsais. Requires OpenMP. < 2 GB limit. Not WASM-compatible. |
generate-expected |
Generate expected test output files. Adds serde_json dependency. |
cli |
Build the fsa CLI tool |
Compatibility
| Platform |
Minimum version |
Notes |
| Python |
3.9+ |
Wheels for Linux (x86_64, aarch64), macOS (x86_64, ARM), Windows (x86_64) |
| Node.js |
18+ |
WASM module, works in any runtime with WebAssembly support |
| Browsers |
Chrome 57+, Firefox 52+, Safari 11+, Edge 16+ |
Any browser with WebAssembly support |
| Rust |
1.91+ |
MSRV for the core crate |
Error behavior
| Scenario |
Python |
JavaScript |
| Empty input |
Returns 0 / empty array |
Returns 0 / empty Uint32Array |
| Pattern longer than text |
Returns 0 / empty array |
Returns 0 / empty Uint32Array |
| Non-UTF8 bytes (Python) |
Works fine — operates on raw bytes |
N/A (JS strings are UTF-16) |
| Empty pattern (FMIndex) |
Returns text length / all positions |
Returns text length / all positions |
| Empty pattern (SA-based) |
Returns 0 / empty array |
Returns 0 / empty Uint32Array |