Use Cases¶
Bioinformatics¶
Index a genome once, search thousands of motifs in O(p log sigma) each. 55,040x faster than bytes.count() on a 50MB genome. Viable for client-side genome search in the browser (FM-index fits in WASM memory where SA doesn't).
In practice, Python bioinformaticians shell out to C tools like BWA and Bowtie2 for indexed genome search. This library brings FM-index capability natively into Python without external tool dependencies.
from fast_suffix_array import FMIndex
genome = open("genome.fasta", "rb").read()
index = FMIndex(genome)
motifs = [b"GAATTC", b"GGATCC", b"AAGCTT", b"TATAAA", b"CACGTG"]
for motif in motifs:
print(f"{motif.decode()}: {index.count(motif)} occurrences")
Client-side search (browser)¶
The only WASM FM-index and suffix array on npm. Enables offline full-text substring search in local-first apps, documentation sites, and note-taking apps. FM-index uses ~1.8x text size (vs 5-9x for SA), fitting in browser memory budgets.
const { WasmFMIndex } = require('fast-suffix-array');
// Build index from page content
const index = new WasmFMIndex(allPageText, 2);
// Instant substring search -- no server required
const positions = index.locate(userQuery);
Writing and editing¶
Find unintentional repeated phrases in manuscripts, articles, and documentation. Plugin-ready: output includes UTF-16 offsets for VSCode and Obsidian.
import fast_suffix_array
results = fast_suffix_array.find_duplicates(
manuscript_text,
min_words=3,
)
for r in results:
print(f'{r["count"]}x ({r["number_of_words"]} words) "{r["phrase"]}"')
Editor plugins¶
Real-time duplicate highlighting in dedicated editor extensions:
- concordance-vscode -- VSCode extension (coming soon)
- concordance-obsidian -- Obsidian plugin (coming soon)
Both use the WASM backend for instant in-editor duplicate detection. Repositories are not yet publicly available.
Data pipelines (pandas replacement)¶
Replace pandas str.contains() / str.count() loops with FM-index for 26-27,000x speedup on multi-pattern search.
from fast_suffix_array import FMIndex
import pandas as pd
# Concatenate all text once
corpus = "\n".join(df["text_column"]).encode()
index = FMIndex(corpus)
# Each pattern query: microseconds instead of seconds
for pattern in patterns:
count = index.count(pattern.encode())
LLM training data dedup¶
The same O(n) suffix array algorithm Google uses to deduplicate training corpora ("Deduplicating Training Data Makes Language Models Better", 2022). Their approach needed 600GB+ RAM for C4. This library targets the small-to-medium data use case -- manuscripts, documentation, note collections -- running in milliseconds on a laptop.
import fast_suffix_array
# Find all repeated passages in a training corpus
duplicates = fast_suffix_array.find_duplicates(
corpus_text,
min_words=5,
enable_block_detection=True,
)
Plagiarism detection¶
Longest common substring between documents. All repeated phrases with positions.
import fast_suffix_array
# Find the longest shared passage between two documents
lcs = fast_suffix_array.longest_common_substring(doc_a, doc_b)
print(f"Longest shared passage ({len(lcs)} chars): {lcs[:100]}...")
Data compression¶
BWT for bzip2-style pipelines. FM-index as compressed self-index (discard original text, query from index). LPF array for Lempel-Ziv factorization.