Skip to content

Algorithm Deep Dive

How the duplicate detection pipeline finds every repeated phrase in a document in O(n) time — and why the naive approach doesn't scale.

For the high-level overview, see the Guide. For benchmark data, see Benchmarks.


The problem

Given a document of n words, find every phrase (sequence of consecutive words) that appears more than once. Report its text, frequency, and all occurrence positions.

Naive approach: Generate all n-grams for window sizes 2..50, count each in a hash map, filter duplicates. This is O(n × m) where m is the window size, and produces millions of candidates that must be deduplicated pairwise — an O(P²) step that took 178 seconds on the King James Bible.

Our approach: Build a suffix array and LCP array (both O(n)), walk the LCP array once to discover only the phrases that actually repeat, then deduplicate with an Aho-Corasick automaton. Total: 1.26 seconds on the same text — a 166× speedup.


Pipeline overview

Input Text (UTF-8)
┌─────────────────────────────────────┐
│ 1. Suffix Array (SA-IS)       O(n)  │
│ 2. LCP Array (Kasai)          O(n)  │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 3. Boundary Precomputation    O(n)  │
│    word starts, ends, sentences     │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 4. LCP Walk                O(n × w) │
│    phrase discovery at boundaries   │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 5. Aho-Corasick Dedup      O(P × L) │
│ 6. Interval Merging      O(P log P) │
│ 7. Block Detection (opt.)     O(n)  │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 8. Byte → UTF-16 Conversion  O(n)  │
└─────────────────────────────────────┘
  Vec<PhraseResult>

Where n = text length in bytes, w = max phrase length in words, P = distinct phrase count, L = average phrase length.


Stage 1: Suffix array (SA-IS)

A suffix array is a sorted permutation of all suffix starting positions. For text "banana", the suffixes sorted lexicographically are:

SA[0] → "a"           (position 5)
SA[1] → "ana"         (position 3)
SA[2] → "anana"       (position 1)
SA[3] → "banana"      (position 0)
SA[4] → "na"          (position 4)
SA[5] → "nana"        (position 2)

This lets you find all occurrences of any substring via binary search in O(|pattern| × log n) time. More importantly, it enables the LCP array (Stage 2), which is the key to efficient phrase discovery.

SA-IS algorithm

We use SA-IS (Nong, Zhang & Chan, 2009), which constructs the suffix array in O(n) time and O(n) space:

  1. Classify each suffix as S-type (smaller than its right neighbor) or L-type (larger). Scan right-to-left in one pass.

  2. Find LMS positions — S-type positions whose left neighbor is L-type. These are the "seams" that anchor the recursive decomposition.

  3. Induced sort — place LMS suffixes in their character buckets, then propagate the sorted order to L-type suffixes (left-to-right scan) and S-type suffixes (right-to-left scan).

  4. Recurse — if any LMS substrings have duplicate names, build a reduced string and sort it recursively. Each level halves the problem (≤ n/2 LMS positions), giving total work n + n/2 + n/4 + … = O(n). If all names are unique, skip the recursion entirely.

  5. Final induced sort — using the recursively sorted LMS order, repeat the induced sort to produce the final suffix array.

Multiple backends

The pure Rust SA-IS implementation works everywhere, including WASM. For native builds, optional C backends provide additional speed:

Backend Speed Limit Availability
SA-IS (default) baseline unlimited always (pure Rust)
divsufsort ~2.5× faster 2 GB divsufsort-backend feature
libsais ~6–8× faster 2 GB libsais-backend feature

The Auto backend (default) selects the fastest available option at runtime, falling back to SA-IS if no C backends are compiled in or the input exceeds 2 GB.

Binary data support

Both build_suffix_array() (UTF-8 text) and build_suffix_array_bytes() (arbitrary &[u8]) are available. The byte variant works on DNA sequences, network packets, binary files — anything.


Stage 2: LCP array (Kasai's algorithm)

The LCP array stores the length of the longest common prefix between consecutive entries in the sorted suffix array:

SA[0] → "a"           LCP[0] = 0  (no predecessor)
SA[1] → "ana"         LCP[1] = 1  (shares "a" with "a")
SA[2] → "anana"       LCP[2] = 3  (shares "ana" with "ana")
SA[3] → "banana"      LCP[3] = 0  (shares nothing with "anana")
SA[4] → "na"          LCP[4] = 0  (shares nothing with "banana")
SA[5] → "nana"        LCP[5] = 2  (shares "na" with "na")

Together with the suffix array, the LCP array identifies all repeated substrings without generating them: if LCP[i] ≥ k, then the first k characters of text[SA[i-1]..] and text[SA[i]..] are identical — a repeated substring of length k.

Key insight

Kasai's algorithm computes the full LCP array in O(n) time by exploiting a monotonicity property: if suffix at position i shares k characters with its sorted predecessor, then suffix at position i+1 shares at least k-1. This means each comparison starts from k-1 instead of 0, and the total character comparisons across all iterations is bounded by O(n).


Stage 3: Boundary detection

Before the LCP walk, three sorted arrays of byte offsets are precomputed in O(n) time:

Word boundaries

A "word byte" is defined as ASCII alphanumeric (a-z, A-Z, 0-9) or non-ASCII (≥ 128, i.e., UTF-8 multi-byte sequences). Everything else is a delimiter.

  • Word starts: positions where a word byte is preceded by a non-word byte (or string start)
  • Word ends: positions after a word byte that is followed by a non-word byte (or string end)

This ensures "text." and "text " produce the same word endpoint, so trailing punctuation doesn't affect phrase matching.

Sentence boundaries

A rule-based splitter identifies sentence boundaries, handling:

  • Abbreviations: 100+ entries (Mr., Dr., Prof., Jan., Ave., etc.)
  • Multi-period abbreviations: U.S.A., Ph.D., e.g., i.e., etc.
  • Decimal numbers: 3.14 is not a sentence break
  • URLs and emails: http:// and user@domain.com are not breaks
  • File extensions: .txt, .pdf, .jpg, etc.
  • CJK terminators: U+3002 (。), U+FF1F (?), U+FF01 (!)
  • Other scripts: Arabic (U+061F), Devanagari (U+0964), Ethiopic, Myanmar

Newlines are always boundaries — phrases never span lines.

All three arrays support O(log n) lookup via binary search during the LCP walk.


Stage 4: LCP walk — phrase discovery

This is the core of the algorithm. Instead of generating all n-grams and counting them, we walk the LCP array once to discover only the phrases that actually repeat.

Algorithm

  1. Compute minimum byte threshold. A phrase with min_phrase_length words needs at least min_string_length + min_phrase_length - 1 bytes (accounting for inter-word spaces).

  2. Find LCP runs. Walk LCP[1..n], grouping consecutive entries where LCP[i] ≥ min_byte_len. Each run represents suffixes in sorted order that share a long common prefix — candidates for repeated phrases.

  3. For each suffix position in a run:

    • Word-boundary filter: binary search the word-starts array to verify the position is word-aligned. Skip mid-word positions.
    • Sentence scope: binary search the sentence-bounds array to cap the phrase at the next sentence boundary.
    • Extract phrases: iterate over word-end positions from min_phrase_length to max_phrase_length words. For each:
      • Extract the substring and normalize whitespace
      • Check minimum string length (non-space characters)
      • Insert into HashMap<phrase, positions>
  4. Filter. Deduplicate positions, discard entries with fewer than 2 occurrences, sort by length descending.

Why this is better than n-gram counting

The naive approach generates all possible n-grams and counts them. On the King James Bible, that means millions of unique strings in a hash map. The LCP walk only visits suffixes that actually share a prefix — it skips the vast majority of the text where no repetition exists.


Stage 5: Aho-Corasick deduplication

The problem

If "the quick brown fox" appears 3 times, then "quick brown fox", "the quick brown", and "quick brown" each also appear ≥ 3 times. These shorter phrases are redundant — they add no information beyond what the longer phrase already captures.

Rule: remove a shorter phrase if it is a substring of a longer phrase with equal or higher count. Keep it if its count is higher — that means it appears independently outside the containing contexts.

Solution

Build an Aho-Corasick automaton from all discovered phrases. The automaton is a trie with failure links that enables simultaneous matching of all patterns in a single pass.

For each phrase, search its text against the automaton to find all shorter phrases it contains. Mark those for removal if their count ≤ the container's count.

Optimization impact

On the King James Bible (114,993 candidate phrases):

Approach Time Operations
Pairwise .contains() 178 seconds ~13 billion comparisons
Aho-Corasick automaton 165 ms single-pass matching
Speedup 1,078×

Stage 6: Interval merging

The problem

After deduplication, adjacent phrases that always co-occur should merge. If "hello" and "world" both appear at the same positions with the same count, and "world" always starts right after "hello" ends, they should be reported as "hello world".

Merge condition

Two phrases A and B can merge if:

  • Same occurrence count
  • Same number of occurrence positions
  • At every corresponding position: B starts within 0–2 bytes after A ends (touching, or separated by a space/punctuation character)

Algorithm

  1. Sort by (count descending, first start position ascending). This groups phrases with equal counts together, and within each group, spatially adjacent phrases become neighbors in the sorted order.

  2. Single linear pass: greedily chain merges (A+B, then (A+B)+C, etc.).

Optimization impact

On the King James Bible (46,468 phrases after dedup):

Approach Time
O(P²) nested loop with repeated passes 37.8 seconds
Sort + linear scan 7.9 ms
Speedup 4,785×

Stage 7: Block detection

Phrases are capped at max_phrase_length words (typically 50). Duplicated paragraphs or entire sections longer than this are missed by the phrase walk.

Block detection reuses the same suffix array and LCP array to find maximal repeated substrings — the longest common prefix per LCP group, without word-boundary alignment:

  1. Walk the LCP array with run grouping (same technique as Stage 4)
  2. Emit only the longest shared prefix per group (byte-length bounded between max_phrase_length × 4 and max_phrase_length × 100)
  3. Validate UTF-8 character boundaries (the LCP value is in bytes and can split multi-byte characters)
  4. Remove shorter substrings fully covered by longer ones via sorted range lookup

After block detection, any n-gram phrases whose every occurrence falls within a detected block are removed as redundant.

UTF-8 safety

LCP values are in bytes. Two multi-byte characters can share leading bytes (e.g., U+2013 '–' and U+2014 '—' both start with 0xE2 0x80), causing the LCP to end mid-character. The implementation rounds down to the nearest valid character boundary using floor_char_boundary().


Stage 8: Byte → UTF-16 offset conversion

Internal processing uses byte offsets (required for Rust string slicing). JavaScript strings and VS Code's positionAt() expect UTF-16 code unit offsets.

A single O(n) pass builds a lookup table: for each byte offset, the cumulative UTF-16 code unit count. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) contribute 1 code unit; characters above U+FFFF (emoji, mathematical symbols) contribute 2 (a surrogate pair).

The string_length field is also converted from bytes to character count.


Performance

Stage Time Complexity Bible (824K words)
1. Suffix array O(n) 277 ms
2. LCP array O(n) 79 ms
3. Boundaries O(n) 202 ms
4. LCP walk O(n × w) 545 ms
5. Aho-Corasick dedup O(P × L) 165 ms
6. Interval merge O(P log P) 8 ms
7. Block detection O(n) < 1 ms
8. UTF-16 conversion O(n) < 1 ms
Total ~1.26 s

Representative end-to-end benchmarks (native Rust, release build, Apple M1):

Document Words Time Phrases
Pride and Prejudice 130K 95 ms 1,558
King James Bible 824K 1.26 s 46,468

See Benchmarks for WASM performance, competitor comparisons, and scaling data.


Optimization history

Two targeted changes to post-processing steps yielded a 166× overall speedup. The suffix array construction, LCP computation, and phrase walk were already efficient (< 1.1s combined) and required no optimization.

Step Before After Speedup
Substring dedup 178 s (O(P²) pairwise .contains()) 165 ms (Aho-Corasick) 1,078×
Interval merge 37.8 s (O(P²) nested loop) 7.9 ms (sort + linear scan) 4,785×
Total pipeline 215 s 1.26 s 166×

References

  1. G. Nong, S. Zhang, W. H. Chan. "Two Efficient Algorithms for Linear Time Suffix Array Construction." IEEE Transactions on Computers, 60(10): 1471–1484, 2011.

  2. T. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park. "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications." Proceedings of CPM, LNCS 2089, pp. 181–192, 2001.

  3. A. V. Aho, M. J. Corasick. "Efficient String Matching: An Aid to Bibliographic Search." Communications of the ACM, 18(6): 333–340, 1975.