N-gram Inverted Index

The n-gram index is Sieve’s primary candidate selection mechanism. It reduces the search space from tens of thousands of entities to a small set of plausible candidates before running the expensive Jaro-Winkler comparison.

How It Works

Build phase (at index load time):

  1. Every entity’s normalized primary name and aliases are decomposed into overlapping 3-character trigrams.

  2. An inverted map is built: trigram [entity IDs].

Query phase (per screening request):

  1. The normalized query is decomposed into trigrams.

  2. Entity IDs are collected by trigram overlap.

  3. Entities sharing ≥30% of query trigrams pass the filter.

  4. A length-ratio check eliminates entities whose names differ from the query by more than 3×.

Trigram Decomposition

A string is split into all overlapping substrings of length 3:

"putin"      → {"put", "uti", "tin"}
"vladimir"   → {"vla", "lad", "adi", "dim", "imi", "mir"}

For the full name "vladimir putin":

→ {"vla", "lad", "adi", "dim", "imi", "mir", "ir ", "r p", " pu", "put", "uti", "tin"}

Strings shorter than 3 characters produce no trigrams and fall back to a full scan (this is rare for real names).

Inverted Index Structure

        flowchart LR
    subgraph "Inverted Index"
        T1["'put'"] --> E1["entity-123, entity-456"]
        T2["'uti'"] --> E2["entity-123, entity-789"]
        T3["'vla'"] --> E3["entity-123"]
        T4["'mic'"] --> E4["entity-999, entity-888"]
    end
    

At query time, trigram lookups are O(1) hash map reads. The engine counts how many trigrams each entity shares with the query:

Query: "putin"  →  trigrams: {"put", "uti", "tin"}

entity-123: hit on "put" + "uti"        → 2/3 = 67% overlap ✅
entity-456: hit on "put"                → 1/3 = 33% overlap ✅
entity-789: hit on "uti"                → 1/3 = 33% overlap ✅
entity-999: no hits                     → 0/3 = 0%  overlap ❌

Filtering Criteria

Two filters reduce the candidate set:

1. Minimum trigram overlap (30%)

An entity must share at least 30% of the query’s trigrams to be considered a candidate. This ratio is tuned to cast a wide enough net for fuzzy matches while eliminating clearly unrelated entities.

int minHits = Math.max(1, (int) (queryTrigrams.size() * 0.3));

2. Length ratio filter (3×)

If the query and the candidate’s shortest/longest name differ by more than a factor of 3, the entity is skipped. This is because Jaro-Winkler cannot produce high scores when string lengths diverge significantly.

double ratio = (double) Math.max(queryLen, longest)
             / Math.min(queryLen, shortest);
if (ratio > 3.0) continue;  // skip — can't produce high JW score

Effectiveness

For a typical index of ~20,000 sanctioned entities:

Metric

Without n-gram

With n-gram

Candidates per query

20,000

~50–200

Jaro-Winkler comparisons

~100,000

~250–1,000

Query latency

~150ms

~0.4ms

Reduction factor

99.5–99.75%

The n-gram filter eliminates 99%+ of entities that have no character-level similarity to the query, while preserving all entities that could plausibly match at or above the configured threshold.

Thread Safety

  • The index rebuilds atomically under a synchronized lock with double-checked locking on the index size.

  • Volatile references to the trigram map and entity map ensure safe publication across threads.

  • Query-time reads are lock-free.

Why Trigrams?

  • Bigrams (n=2) produce too many false positives — common letter pairs like “an”, “er”, “in” appear in most names.

  • 4-grams are too specific and miss legitimate fuzzy matches where characters are transposed or substituted.

  • Trigrams strike the right balance for name matching in sanctions screening.