Jaro-Winkler Similarity ======================= Sieve uses the Jaro-Winkler string similarity algorithm for fuzzy name matching. It is implemented from scratch in pure Java with no external dependencies. .. contents:: On this page :local: :depth: 2 Overview -------- Jaro-Winkler produces a similarity score between **0.0** (no similarity) and **1.0** (exact match). It is particularly well-suited for name matching because it rewards strings that share a common prefix — a property that aligns with how human names tend to vary (e.g., "Vladimir" vs "Vladimyr"). The algorithm consists of two phases: 1. **Jaro similarity** — counts matching characters and transpositions 2. **Winkler adjustment** — boosts the score for shared prefixes Jaro Similarity --------------- Given two strings *s₁* and *s₂*, the Jaro similarity is: .. math:: jaro = \frac{1}{3} \left( \frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} \right) Where: - *m* = number of matching characters - *t* = number of transpositions / 2 - Characters are considered matching if they are the same and within ⌊max(\|s₁\|, \|s₂\|) / 2⌋ - 1 positions of each other **Example:** .. code-block:: text s₁ = "putin" s₂ = "poutin" Match window = max(5, 6) / 2 - 1 = 2 Matches: p-p, u-u, t-t, i-i, n-n → m = 5 Transpositions: 0 Jaro = (5/5 + 5/6 + 5/5) / 3 = 0.9444 Winkler Adjustment ------------------ The Winkler modification boosts the Jaro score for strings sharing a common prefix (up to 4 characters): .. math:: jw = jaro + \ell \cdot p \cdot (1 - jaro) Where: - *ℓ* = length of common prefix (max 4) - *p* = prefix scaling factor (default: 0.1) **Example:** .. code-block:: text "putin" vs "poutin" Common prefix: "p" → ℓ = 1 JW = 0.9444 + 1 × 0.1 × (1 - 0.9444) = 0.9500 Threshold-Aware Early Exit -------------------------- Sieve includes an optimized ``similarityWithThreshold()`` method that skips the full computation when the result cannot possibly meet the threshold: .. code-block:: java public static double similarityWithThreshold(String s1, String s2, double threshold) { // Upper bound: Jaro score <= (2 + min/max) / 3 int shorter = Math.min(s1.length(), s2.length()); int longer = Math.max(s1.length(), s2.length()); double maxPossibleJaro = (2.0 + (double) shorter / longer) / 3.0; double maxPossible = maxPossibleJaro + (4 * 0.1 * (1.0 - maxPossibleJaro)); if (maxPossible < threshold) { return 0.0; // impossible to meet threshold — skip } return similarity(s1, s2); } This eliminates unnecessary comparisons when string lengths diverge too much. For example, ``"al"`` (length 2) vs ``"international corporation"`` (length 27) has a maximum possible Jaro score of 0.358 — well below any practical threshold. Memory Optimization ------------------- The standard Jaro algorithm allocates two boolean arrays per comparison to track matched characters. At ~250 comparisons per query, this creates significant GC pressure. Sieve eliminates this by using **ThreadLocal reusable arrays**: .. code-block:: java private static final ThreadLocal TL_MATCHED_1 = ThreadLocal.withInitial(() -> new boolean[128]); static double jaro(String s1, String s2) { boolean[] s1Matched = borrowArray(TL_MATCHED_1, s1.length()); boolean[] s2Matched = borrowArray(TL_MATCHED_2, s2.length()); // ... compute matches and transpositions ... clearArray(s1Matched, s1.length()); // reset for reuse clearArray(s2Matched, s2.length()); } The arrays grow on demand but are never deallocated, achieving **zero allocation** in the steady state. Complexity ---------- .. list-table:: :header-rows: 1 :widths: 25 75 * - Aspect - Complexity * - Time - O(n × w) where n = max string length, w = match window * - Space - O(n) — reused via ThreadLocal * - Typical cost - ~1–2µs per comparison for names of length 10–40 References ---------- - Winkler, W. E. (1990). "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage" - `Jaro–Winkler distance (Wikipedia) `_