Algorithm Deep Dive¶
How the duplicate detection pipeline finds every repeated phrase in a document in O(n) time — and why the naive approach doesn't scale.
For the high-level overview, see the Guide. For benchmark data, see Benchmarks.
The problem¶
Given a document of n words, find every phrase (sequence of consecutive words) that appears more than once. Report its text, frequency, and all occurrence positions.
Naive approach: Generate all n-grams for window sizes 2..50, count each in a hash map, filter duplicates. This is O(n × m) where m is the window size, and produces millions of candidates that must be deduplicated pairwise — an O(P²) step that took 178 seconds on the King James Bible.
Our approach: Build a suffix array and LCP array (both O(n)), walk the LCP array once to discover only the phrases that actually repeat, then deduplicate with an Aho-Corasick automaton. Total: 1.26 seconds on the same text — a 166× speedup.
Pipeline overview¶
Input Text (UTF-8)
│
▼
┌─────────────────────────────────────┐
│ 1. Suffix Array (SA-IS) O(n) │
│ 2. LCP Array (Kasai) O(n) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Boundary Precomputation O(n) │
│ word starts, ends, sentences │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. LCP Walk O(n × w) │
│ phrase discovery at boundaries │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 5. Aho-Corasick Dedup O(P × L) │
│ 6. Interval Merging O(P log P) │
│ 7. Block Detection (opt.) O(n) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 8. Byte → UTF-16 Conversion O(n) │
└─────────────────────────────────────┘
│
▼
Vec<PhraseResult>
Where n = text length in bytes, w = max phrase length in words, P = distinct phrase count, L = average phrase length.
Stage 1: Suffix array (SA-IS)¶
A suffix array is a sorted permutation of all suffix starting positions.
For text "banana", the suffixes sorted lexicographically are:
SA[0] → "a" (position 5)
SA[1] → "ana" (position 3)
SA[2] → "anana" (position 1)
SA[3] → "banana" (position 0)
SA[4] → "na" (position 4)
SA[5] → "nana" (position 2)
This lets you find all occurrences of any substring via binary search in O(|pattern| × log n) time. More importantly, it enables the LCP array (Stage 2), which is the key to efficient phrase discovery.
SA-IS algorithm¶
We use SA-IS (Nong, Zhang & Chan, 2009), which constructs the suffix array in O(n) time and O(n) space:
-
Classify each suffix as S-type (smaller than its right neighbor) or L-type (larger). Scan right-to-left in one pass.
-
Find LMS positions — S-type positions whose left neighbor is L-type. These are the "seams" that anchor the recursive decomposition.
-
Induced sort — place LMS suffixes in their character buckets, then propagate the sorted order to L-type suffixes (left-to-right scan) and S-type suffixes (right-to-left scan).
-
Recurse — if any LMS substrings have duplicate names, build a reduced string and sort it recursively. Each level halves the problem (≤ n/2 LMS positions), giving total work n + n/2 + n/4 + … = O(n). If all names are unique, skip the recursion entirely.
-
Final induced sort — using the recursively sorted LMS order, repeat the induced sort to produce the final suffix array.
Multiple backends¶
The pure Rust SA-IS implementation works everywhere, including WASM. For native builds, optional C backends provide additional speed:
| Backend | Speed | Limit | Availability |
|---|---|---|---|
| SA-IS (default) | baseline | unlimited | always (pure Rust) |
| divsufsort | ~2.5× faster | 2 GB | divsufsort-backend feature |
| libsais | ~6–8× faster | 2 GB | libsais-backend feature |
The Auto backend (default) selects the fastest available option at runtime,
falling back to SA-IS if no C backends are compiled in or the input exceeds
2 GB.
Binary data support
Both build_suffix_array() (UTF-8 text) and build_suffix_array_bytes()
(arbitrary &[u8]) are available. The byte variant works on DNA sequences,
network packets, binary files — anything.
Stage 2: LCP array (Kasai's algorithm)¶
The LCP array stores the length of the longest common prefix between consecutive entries in the sorted suffix array:
SA[0] → "a" LCP[0] = 0 (no predecessor)
SA[1] → "ana" LCP[1] = 1 (shares "a" with "a")
SA[2] → "anana" LCP[2] = 3 (shares "ana" with "ana")
SA[3] → "banana" LCP[3] = 0 (shares nothing with "anana")
SA[4] → "na" LCP[4] = 0 (shares nothing with "banana")
SA[5] → "nana" LCP[5] = 2 (shares "na" with "na")
Together with the suffix array, the LCP array identifies all repeated
substrings without generating them: if LCP[i] ≥ k, then the first k
characters of text[SA[i-1]..] and text[SA[i]..] are identical — a
repeated substring of length k.
Key insight¶
Kasai's algorithm computes the full LCP array in O(n) time by exploiting
a monotonicity property: if suffix at position i shares k characters with
its sorted predecessor, then suffix at position i+1 shares at least k-1.
This means each comparison starts from k-1 instead of 0, and the total
character comparisons across all iterations is bounded by O(n).
Stage 3: Boundary detection¶
Before the LCP walk, three sorted arrays of byte offsets are precomputed in O(n) time:
Word boundaries¶
A "word byte" is defined as ASCII alphanumeric (a-z, A-Z, 0-9) or
non-ASCII (≥ 128, i.e., UTF-8 multi-byte sequences). Everything else is a
delimiter.
- Word starts: positions where a word byte is preceded by a non-word byte (or string start)
- Word ends: positions after a word byte that is followed by a non-word byte (or string end)
This ensures "text." and "text " produce the same word endpoint, so
trailing punctuation doesn't affect phrase matching.
Sentence boundaries¶
A rule-based splitter identifies sentence boundaries, handling:
- Abbreviations: 100+ entries (Mr., Dr., Prof., Jan., Ave., etc.)
- Multi-period abbreviations: U.S.A., Ph.D., e.g., i.e., etc.
- Decimal numbers:
3.14is not a sentence break - URLs and emails:
http://anduser@domain.comare not breaks - File extensions:
.txt,.pdf,.jpg, etc. - CJK terminators: U+3002 (。), U+FF1F (?), U+FF01 (!)
- Other scripts: Arabic (U+061F), Devanagari (U+0964), Ethiopic, Myanmar
Newlines are always boundaries — phrases never span lines.
All three arrays support O(log n) lookup via binary search during the LCP walk.
Stage 4: LCP walk — phrase discovery¶
This is the core of the algorithm. Instead of generating all n-grams and counting them, we walk the LCP array once to discover only the phrases that actually repeat.
Algorithm¶
-
Compute minimum byte threshold. A phrase with
min_phrase_lengthwords needs at leastmin_string_length + min_phrase_length - 1bytes (accounting for inter-word spaces). -
Find LCP runs. Walk
LCP[1..n], grouping consecutive entries whereLCP[i] ≥ min_byte_len. Each run represents suffixes in sorted order that share a long common prefix — candidates for repeated phrases. -
For each suffix position in a run:
- Word-boundary filter: binary search the word-starts array to verify the position is word-aligned. Skip mid-word positions.
- Sentence scope: binary search the sentence-bounds array to cap the phrase at the next sentence boundary.
- Extract phrases: iterate over word-end positions from
min_phrase_lengthtomax_phrase_lengthwords. For each:- Extract the substring and normalize whitespace
- Check minimum string length (non-space characters)
- Insert into
HashMap<phrase, positions>
-
Filter. Deduplicate positions, discard entries with fewer than 2 occurrences, sort by length descending.
Why this is better than n-gram counting
The naive approach generates all possible n-grams and counts them. On the King James Bible, that means millions of unique strings in a hash map. The LCP walk only visits suffixes that actually share a prefix — it skips the vast majority of the text where no repetition exists.
Stage 5: Aho-Corasick deduplication¶
The problem¶
If "the quick brown fox" appears 3 times, then "quick brown fox",
"the quick brown", and "quick brown" each also appear ≥ 3 times. These
shorter phrases are redundant — they add no information beyond what the
longer phrase already captures.
Rule: remove a shorter phrase if it is a substring of a longer phrase with equal or higher count. Keep it if its count is higher — that means it appears independently outside the containing contexts.
Solution¶
Build an Aho-Corasick automaton from all discovered phrases. The automaton is a trie with failure links that enables simultaneous matching of all patterns in a single pass.
For each phrase, search its text against the automaton to find all shorter phrases it contains. Mark those for removal if their count ≤ the container's count.
Optimization impact
On the King James Bible (114,993 candidate phrases):
| Approach | Time | Operations |
|---|---|---|
Pairwise .contains() |
178 seconds | ~13 billion comparisons |
| Aho-Corasick automaton | 165 ms | single-pass matching |
| Speedup | 1,078× |
Stage 6: Interval merging¶
The problem¶
After deduplication, adjacent phrases that always co-occur should merge. If
"hello" and "world" both appear at the same positions with the same
count, and "world" always starts right after "hello" ends, they should be
reported as "hello world".
Merge condition¶
Two phrases A and B can merge if:
- Same occurrence count
- Same number of occurrence positions
- At every corresponding position: B starts within 0–2 bytes after A ends (touching, or separated by a space/punctuation character)
Algorithm¶
-
Sort by (count descending, first start position ascending). This groups phrases with equal counts together, and within each group, spatially adjacent phrases become neighbors in the sorted order.
-
Single linear pass: greedily chain merges (A+B, then (A+B)+C, etc.).
Optimization impact
On the King James Bible (46,468 phrases after dedup):
| Approach | Time |
|---|---|
| O(P²) nested loop with repeated passes | 37.8 seconds |
| Sort + linear scan | 7.9 ms |
| Speedup | 4,785× |
Stage 7: Block detection¶
Phrases are capped at max_phrase_length words (typically 50). Duplicated
paragraphs or entire sections longer than this are missed by the phrase walk.
Block detection reuses the same suffix array and LCP array to find maximal repeated substrings — the longest common prefix per LCP group, without word-boundary alignment:
- Walk the LCP array with run grouping (same technique as Stage 4)
- Emit only the longest shared prefix per group (byte-length bounded between
max_phrase_length × 4andmax_phrase_length × 100) - Validate UTF-8 character boundaries (the LCP value is in bytes and can split multi-byte characters)
- Remove shorter substrings fully covered by longer ones via sorted range lookup
After block detection, any n-gram phrases whose every occurrence falls within a detected block are removed as redundant.
UTF-8 safety
LCP values are in bytes. Two multi-byte characters can share leading bytes
(e.g., U+2013 '–' and U+2014 '—' both start with 0xE2 0x80), causing
the LCP to end mid-character. The implementation rounds down to the nearest
valid character boundary using floor_char_boundary().
Stage 8: Byte → UTF-16 offset conversion¶
Internal processing uses byte offsets (required for Rust string slicing).
JavaScript strings and VS Code's positionAt() expect UTF-16 code unit
offsets.
A single O(n) pass builds a lookup table: for each byte offset, the cumulative UTF-16 code unit count. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) contribute 1 code unit; characters above U+FFFF (emoji, mathematical symbols) contribute 2 (a surrogate pair).
The string_length field is also converted from bytes to character count.
Performance¶
| Stage | Time Complexity | Bible (824K words) |
|---|---|---|
| 1. Suffix array | O(n) | 277 ms |
| 2. LCP array | O(n) | 79 ms |
| 3. Boundaries | O(n) | 202 ms |
| 4. LCP walk | O(n × w) | 545 ms |
| 5. Aho-Corasick dedup | O(P × L) | 165 ms |
| 6. Interval merge | O(P log P) | 8 ms |
| 7. Block detection | O(n) | < 1 ms |
| 8. UTF-16 conversion | O(n) | < 1 ms |
| Total | ~1.26 s |
Representative end-to-end benchmarks (native Rust, release build, Apple M1):
| Document | Words | Time | Phrases |
|---|---|---|---|
| Pride and Prejudice | 130K | 95 ms | 1,558 |
| King James Bible | 824K | 1.26 s | 46,468 |
See Benchmarks for WASM performance, competitor comparisons, and scaling data.
Optimization history¶
Two targeted changes to post-processing steps yielded a 166× overall speedup. The suffix array construction, LCP computation, and phrase walk were already efficient (< 1.1s combined) and required no optimization.
| Step | Before | After | Speedup |
|---|---|---|---|
| Substring dedup | 178 s (O(P²) pairwise .contains()) |
165 ms (Aho-Corasick) | 1,078× |
| Interval merge | 37.8 s (O(P²) nested loop) | 7.9 ms (sort + linear scan) | 4,785× |
| Total pipeline | 215 s | 1.26 s | 166× |
References¶
-
G. Nong, S. Zhang, W. H. Chan. "Two Efficient Algorithms for Linear Time Suffix Array Construction." IEEE Transactions on Computers, 60(10): 1471–1484, 2011.
-
T. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park. "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications." Proceedings of CPM, LNCS 2089, pp. 181–192, 2001.
-
A. V. Aho, M. J. Corasick. "Efficient String Matching: An Aid to Bibliographic Search." Communications of the ACM, 18(6): 333–340, 1975.