Skip to content

API Reference

Getting started with the FM-index

Python

Build an index once, then run as many queries as you like in O(p log sigma) time:

from fast_suffix_array import FMIndex

# Index accepts bytes — encode your text first
data = open("corpus.txt", "rb").read()
index = FMIndex(data)

# Count occurrences (returns int)
n = index.count(b"search term")

# Locate all positions (returns numpy array of u64)
positions = index.locate(b"search term")

# Check index size
print(f"Index uses {index.heap_size():,} bytes for {len(index):,} bytes of text")

JavaScript / WASM (Node.js)

const { WasmFMIndex } = require('fast-suffix-array');

const text = 'your corpus text here...';
const index = new WasmFMIndex(text, 2);  // 2 = locate sampling level

console.log(index.count('pattern'));      // number
console.log(index.locate('pattern'));     // Uint32Array
console.log(`${index.heapSize()} bytes`);

JavaScript / WASM (Browser / ESM)

<script type="module">
import init, { WasmFMIndex } from 'fast-suffix-array';

await init();  // fetches .wasm from same directory
const index = new WasmFMIndex(text, 2);
console.log(index.count('pattern'));
</script>

The init() function accepts:

  • No argument: fetches .wasm from the same directory via import.meta.url
  • string or URL: fetches from the given URL
  • Response or Promise<Response>: uses streaming instantiation
  • ArrayBuffer or Uint8Array: compiles from raw bytes
  • WebAssembly.Module: instantiates directly
// Obsidian plugin example — provide the WASM bytes yourself
import init, { WasmFMIndex } from 'fast-suffix-array';
const wasmBytes = await this.app.vault.adapter.readBinary('plugins/my-plugin/fast_suffix_array_bg.wasm');
await init(wasmBytes);

Node.js vs Browser

The npm package ships dual entry points. Node.js (require) loads the WASM synchronously via fs.readFileSync. Browser/ESM (import) requires calling init() first. The explicit fast-suffix-array/browser export is also available for bundlers that don't resolve the import condition.


Python

FMIndex class

from fast_suffix_array import FMIndex

index = FMIndex(data: bytes, locate_level=2)
Method Returns Description
index.count(pattern: bytes) int Count occurrences in O(p log sigma) time
index.locate(pattern: bytes) numpy.ndarray (u64) All occurrence positions
len(index) int Text length
index.heap_size() int Bytes used by index

Duplicate detection

find_duplicates(
    text,
    min_words=4,
    min_chars=9,
    max_words=50,
    min_words_in_substring=2,
    enable_block_detection=False,
)
# -> list of dicts: {phrase, count, string_length, number_of_words, occurrences}

String analysis

longest_common_substring(text1, text2)   # -> str
distinct_substring_count(text)           # -> int

Core suffix array operations

suffix_array(data: bytes, backend=None)  # -> numpy.ndarray (u64)
lcp_array(data: bytes)                   # -> numpy.ndarray (u64)
count(data: bytes, pattern: bytes)       # -> int
locate(data: bytes, pattern: bytes)      # -> numpy.ndarray (u64)
search_range(data: bytes, pattern: bytes)# -> tuple (lo, hi)

Transforms

bwt(data: bytes)                         # -> (bytes, int)
inverse_bwt(bwt: bytes, primary_index: int) # -> bytes
inverse_suffix_array(data: bytes)        # -> numpy.ndarray (u64)

Serialization

sa_to_bytes(data: bytes)                 # -> bytes
sa_from_bytes(data: bytes)               # -> numpy.ndarray (u64)

JavaScript / WASM

WasmFMIndex class

const { WasmFMIndex } = require('fast-suffix-array');

const index = new WasmFMIndex(text, locateLevel);
Method / Property Returns Description
index.count('pattern') number Count occurrences in O(p log sigma) time
index.locate('pattern') Uint32Array All occurrence positions
WasmFMIndex.fromBytes(uint8Array, locateLevel) WasmFMIndex Construct from raw bytes
index.length number Text length
index.heapSize() number Bytes used by index

Duplicate detection

process_test_string(
  sanitized,   // sanitized_text (empty string = use source_text)
  source,      // source text
  minWords,    // min phrase length (words)
  minChars,    // min string length (chars)
  maxWords,    // max phrase length (words)
  minSubWords, // min words in substring
  blocks,      // enable block detection (boolean)
)

String analysis

longest_common_substring(text1, text2)   // -> string
distinct_substring_count(text)           // -> number
longest_repeated_substring(text)         // -> string

Suffix array (UTF-16 for JS compat)

suffix_array_u16(text)                   // -> Uint32Array
lcp_array_u16(text)                      // -> Uint32Array
search_pattern(text, pattern)            // -> Uint32Array (positions)

Binary data (Uint8Array in, Uint32Array out)

suffix_array_bytes(data)
lcp_array_bytes(data)
search_pattern_bytes(data, pattern)
count_pattern_bytes(data, pattern)

Transforms

burrows_wheeler_transform(text)          // -> string
bwt_bytes(data)                          // -> Uint8Array
bwt_with_index(data)                     // -> Uint8Array (use with bwt_primary_index for roundtrip)
bwt_primary_index(data)                  // -> number
inverse_bwt_bytes(bwt, primaryIndex)     // -> Uint8Array

BWT roundtrip

To reconstruct original data from its BWT, use bwt_primary_index to get the index needed by inverse_bwt_bytes:

const idx = bwt_primary_index(data);
const bwt = bwt_bytes(data);
const original = inverse_bwt_bytes(bwt, idx);

Mnemonist-compatible class

const SuffixArray = require('fast-suffix-array/suffix-array');

const sa = new SuffixArray('banana');
sa.array;   // [5, 3, 1, 0, 4, 2]
sa.length;  // 6

const GeneralizedSuffixArray = SuffixArray.GeneralizedSuffixArray;
const gsa = new GeneralizedSuffixArray(['a', 'b']);
gsa.longestCommonSubsequence();

CLI

fsa scan <file> [--min-words 4] [--min-chars 9] [--max-words 50] [--min-count 2] [--top N] [--json] [--no-blocks] [--quiet]
fsa search <file> <pattern> [--bytes] [--json] [--backend auto|sais|divsufsort|libsais]
fsa count <file> <pattern> [--bytes] [--backend auto|sais|divsufsort|libsais]
fsa lcs <file1> <file2>
fsa info <file> [--json]
fsa sa <file> [--bytes] [--backend auto|sais|divsufsort|libsais] [--json]
fsa lcp <file> [--bytes] [--backend auto|sais|divsufsort|libsais] [--json]
fsa bwt <file> [--bytes]
Subcommand Description
scan Find duplicate phrases in a file
search Search for a pattern and return positions
count Count occurrences of a pattern
lcs Longest common substring between two files
info File statistics (word count, distinct substrings)
sa Build and display suffix array
lcp Build and display LCP array
bwt Compute Burrows-Wheeler transform

Rust

[dependencies]
fast-suffix-array-core = "0.1"
Module Key exports
suffix_array build_suffix_array, build_lcp_array, sa_count, sa_locate, burrows_wheeler_transform_bytes, inverse_bwt, longest_common_substring_str, distinct_substring_count_str, SaBackend
fm_index FMIndexBuilder, FMIndexFull, FMIndexCountOnly, FMIndexSearchable (requires fm-index-backend feature)
duplicate_finder process_test_string
sentence_splitter split_into_sentences, sentence_boundary_offsets
tokenizer tokenize, word_start_offsets, word_end_offsets
types PhraseResult, PhraseData, Occurrence
interval_merger merge_intervals — merge overlapping character ranges (used internally by duplicate detection)

Core suffix array operations

use fast_suffix_array_core::suffix_array;

let text = "banana";
let sa = suffix_array::build_suffix_array(text);
let lcp = suffix_array::build_lcp_array(text, &sa);

// Pattern search — needs pre-built SA
let count = suffix_array::sa_count(text.as_bytes(), &sa, b"ana");
let positions = suffix_array::sa_locate(text.as_bytes(), &sa, b"ana");

// Binary data works too
let data: &[u8] = b"\xDE\xAD\xBE\xEF";
let sa_bytes = suffix_array::build_suffix_array_bytes(data);

Backend selection

use fast_suffix_array_core::suffix_array::SaBackend;

// Use a specific backend (requires the corresponding feature)
let sa = suffix_array::build_suffix_array_with_backend(text, SaBackend::DivSufSort);

// Auto picks the fastest available backend
let sa = suffix_array::build_suffix_array_with_backend(text, SaBackend::Auto);

Burrows-Wheeler transform

let bwt = suffix_array::burrows_wheeler_transform_bytes(text.as_bytes(), &sa);

// Use _with_index variant to get the primary index needed for inversion
let (bwt, idx) = suffix_array::burrows_wheeler_transform_bytes_with_index(text.as_bytes(), &sa);
let original = suffix_array::inverse_bwt(&bwt, idx);

FM-Index (requires fm-index-backend feature)

use fast_suffix_array_core::fm_index::{FMIndexBuilder, FMIndexSearchable};

let index = FMIndexBuilder::from_text("your corpus text")
    .locate_level(2)
    .build()       // returns Result<FMIndexFull, FMIndexError>
    .unwrap();

let count = index.count(b"pattern");       // via FMIndexSearchable trait
let positions = index.locate(b"pattern");  // Vec<usize>

String analysis

let lcs = suffix_array::longest_common_substring_str("abcdef", "cdefgh");
let distinct = suffix_array::distinct_substring_count_str("banana");

Cargo features

Feature Description
fm-index-backend FM-index support: compressed full-text index with O(p log sigma) pattern matching. ~1.8x text size in memory.
divsufsort-backend ~1.3-2.5x faster SA construction. Wraps libdivsufsort. No OpenMP. < 2 GB limit. Not WASM-compatible. Enabled by default.
libsais-backend ~2-4x faster SA construction. Wraps libsais. Requires OpenMP. < 2 GB limit. Not WASM-compatible.
generate-expected Generate expected test output files. Adds serde_json dependency.
cli Build the fsa CLI tool

Compatibility

Platform Minimum version Notes
Python 3.9+ Wheels for Linux (x86_64, aarch64), macOS (x86_64, ARM), Windows (x86_64)
Node.js 18+ WASM module, works in any runtime with WebAssembly support
Browsers Chrome 57+, Firefox 52+, Safari 11+, Edge 16+ Any browser with WebAssembly support
Rust 1.91+ MSRV for the core crate

Error behavior

Scenario Python JavaScript
Empty input Returns 0 / empty array Returns 0 / empty Uint32Array
Pattern longer than text Returns 0 / empty array Returns 0 / empty Uint32Array
Non-UTF8 bytes (Python) Works fine — operates on raw bytes N/A (JS strings are UTF-16)
Empty pattern (FMIndex) Returns text length / all positions Returns text length / all positions
Empty pattern (SA-based) Returns 0 / empty array Returns 0 / empty Uint32Array