N-gram Inverted Index ===================== The n-gram index is Sieve's primary candidate selection mechanism. It reduces the search space from tens of thousands of entities to a small set of plausible candidates before running the expensive Jaro-Winkler comparison. .. contents:: On this page :local: :depth: 2 How It Works ------------ **Build phase** (at index load time): 1. Every entity's normalized primary name and aliases are decomposed into overlapping 3-character **trigrams**. 2. An inverted map is built: ``trigram → [entity IDs]``. **Query phase** (per screening request): 1. The normalized query is decomposed into trigrams. 2. Entity IDs are collected by trigram overlap. 3. Entities sharing ≥30% of query trigrams pass the filter. 4. A length-ratio check eliminates entities whose names differ from the query by more than 3×. Trigram Decomposition --------------------- A string is split into all overlapping substrings of length 3: .. code-block:: text "putin" → {"put", "uti", "tin"} "vladimir" → {"vla", "lad", "adi", "dim", "imi", "mir"} For the full name ``"vladimir putin"``: .. code-block:: text → {"vla", "lad", "adi", "dim", "imi", "mir", "ir ", "r p", " pu", "put", "uti", "tin"} Strings shorter than 3 characters produce no trigrams and fall back to a full scan (this is rare for real names). Inverted Index Structure ------------------------ .. mermaid:: flowchart LR subgraph "Inverted Index" T1["'put'"] --> E1["entity-123, entity-456"] T2["'uti'"] --> E2["entity-123, entity-789"] T3["'vla'"] --> E3["entity-123"] T4["'mic'"] --> E4["entity-999, entity-888"] end At query time, trigram lookups are O(1) hash map reads. The engine counts how many trigrams each entity shares with the query: .. code-block:: text Query: "putin" → trigrams: {"put", "uti", "tin"} entity-123: hit on "put" + "uti" → 2/3 = 67% overlap ✅ entity-456: hit on "put" → 1/3 = 33% overlap ✅ entity-789: hit on "uti" → 1/3 = 33% overlap ✅ entity-999: no hits → 0/3 = 0% overlap ❌ Filtering Criteria ------------------ Two filters reduce the candidate set: **1. Minimum trigram overlap (30%)** An entity must share at least 30% of the query's trigrams to be considered a candidate. This ratio is tuned to cast a wide enough net for fuzzy matches while eliminating clearly unrelated entities. .. code-block:: java int minHits = Math.max(1, (int) (queryTrigrams.size() * 0.3)); **2. Length ratio filter (3×)** If the query and the candidate's shortest/longest name differ by more than a factor of 3, the entity is skipped. This is because Jaro-Winkler cannot produce high scores when string lengths diverge significantly. .. code-block:: java double ratio = (double) Math.max(queryLen, longest) / Math.min(queryLen, shortest); if (ratio > 3.0) continue; // skip — can't produce high JW score Effectiveness ------------- For a typical index of ~20,000 sanctioned entities: .. list-table:: :header-rows: 1 :widths: 40 30 30 * - Metric - Without n-gram - With n-gram * - Candidates per query - 20,000 - ~50–200 * - Jaro-Winkler comparisons - ~100,000 - ~250–1,000 * - Query latency - ~150ms - ~0.4ms * - Reduction factor - — - **99.5–99.75%** The n-gram filter eliminates 99%+ of entities that have no character-level similarity to the query, while preserving all entities that could plausibly match at or above the configured threshold. Thread Safety ------------- - The index rebuilds atomically under a synchronized lock with double-checked locking on the index size. - Volatile references to the trigram map and entity map ensure safe publication across threads. - Query-time reads are lock-free. Why Trigrams? ------------- - **Bigrams** (n=2) produce too many false positives — common letter pairs like "an", "er", "in" appear in most names. - **4-grams** are too specific and miss legitimate fuzzy matches where characters are transposed or substituted. - **Trigrams** strike the right balance for name matching in sanctions screening.