N-gram Inverted Index¶

The n-gram index is Sieve’s primary candidate selection mechanism. It reduces the search space from tens of thousands of entities to a small set of plausible candidates before running the expensive Jaro-Winkler comparison.

How It Works ¶

Build phase (at index load time):

Every entity’s normalized primary name and aliases are decomposed into overlapping 3-character trigrams.
An inverted map is built: trigram → [entity IDs].

Query phase (per screening request):

The normalized query is decomposed into trigrams.
Entity IDs are collected by trigram overlap.
Entities sharing ≥30% of query trigrams pass the filter.
A length-ratio check eliminates entities whose names differ from the query by more than 3×.

Trigram Decomposition ¶

A string is split into all overlapping substrings of length 3:

"putin"      → {"put", "uti", "tin"}
"vladimir"   → {"vla", "lad", "adi", "dim", "imi", "mir"}

For the full name "vladimir putin":

→ {"vla", "lad", "adi", "dim", "imi", "mir", "ir ", "r p", " pu", "put", "uti", "tin"}

Strings shorter than 3 characters produce no trigrams and fall back to a full scan (this is rare for real names).

Inverted Index Structure ¶

        flowchart LR
    subgraph "Inverted Index"
        T1["'put'"] --> E1["entity-123, entity-456"]
        T2["'uti'"] --> E2["entity-123, entity-789"]
        T3["'vla'"] --> E3["entity-123"]
        T4["'mic'"] --> E4["entity-999, entity-888"]
    end

At query time, trigram lookups are O(1) hash map reads. The engine counts how many trigrams each entity shares with the query:

Query: "putin"  →  trigrams: {"put", "uti", "tin"}

entity-123: hit on "put" + "uti"        → 2/3 = 67% overlap ✅
entity-456: hit on "put"                → 1/3 = 33% overlap ✅
entity-789: hit on "uti"                → 1/3 = 33% overlap ✅
entity-999: no hits                     → 0/3 = 0%  overlap ❌

Filtering Criteria ¶

Two filters reduce the candidate set:

1. Minimum trigram overlap (30%)

An entity must share at least 30% of the query’s trigrams to be considered a candidate. This ratio is tuned to cast a wide enough net for fuzzy matches while eliminating clearly unrelated entities.

int minHits = Math.max(1, (int) (queryTrigrams.size() * 0.3));

2. Length ratio filter (3×)

If the query and the candidate’s shortest/longest name differ by more than a factor of 3, the entity is skipped. This is because Jaro-Winkler cannot produce high scores when string lengths diverge significantly.

double ratio = (double) Math.max(queryLen, longest)
             / Math.min(queryLen, shortest);
if (ratio > 3.0) continue;  // skip — can't produce high JW score

Effectiveness ¶

For a typical index of ~20,000 sanctioned entities:

Metric	Without n-gram	With n-gram
Candidates per query	20,000	~50–200
Jaro-Winkler comparisons	~100,000	~250–1,000
Query latency	~150ms	~0.4ms
Reduction factor	—	99.5–99.75%

The n-gram filter eliminates 99%+ of entities that have no character-level similarity to the query, while preserving all entities that could plausibly match at or above the configured threshold.

Thread Safety ¶

The index rebuilds atomically under a synchronized lock with double-checked locking on the index size.
Volatile references to the trigram map and entity map ensure safe publication across threads.
Query-time reads are lock-free.

Why Trigrams?¶

Bigrams (n=2) produce too many false positives — common letter pairs like “an”, “er”, “in” appear in most names.
4-grams are too specific and miss legitimate fuzzy matches where characters are transposed or substituted.
Trigrams strike the right balance for name matching in sanctions screening.

N-gram Inverted Index¶

How It Works¶

Trigram Decomposition¶

Inverted Index Structure¶

Filtering Criteria¶

Effectiveness¶

Thread Safety¶