Jaro-Winkler Similarity

Sieve uses the Jaro-Winkler string similarity algorithm for fuzzy name matching. It is implemented from scratch in pure Java with no external dependencies.

Overview

Jaro-Winkler produces a similarity score between 0.0 (no similarity) and 1.0 (exact match). It is particularly well-suited for name matching because it rewards strings that share a common prefix — a property that aligns with how human names tend to vary (e.g., “Vladimir” vs “Vladimyr”).

The algorithm consists of two phases:

  1. Jaro similarity — counts matching characters and transpositions

  2. Winkler adjustment — boosts the score for shared prefixes

Jaro Similarity

Given two strings s₁ and s₂, the Jaro similarity is:

\[jaro = \frac{1}{3} \left( \frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} \right)\]

Where:

  • m = number of matching characters

  • t = number of transpositions / 2

  • Characters are considered matching if they are the same and within ⌊max(|s₁|, |s₂|) / 2⌋ - 1 positions of each other

Example:

s₁ = "putin"
s₂ = "poutin"

Match window = max(5, 6) / 2 - 1 = 2
Matches: p-p, u-u, t-t, i-i, n-n → m = 5
Transpositions: 0
Jaro = (5/5 + 5/6 + 5/5) / 3 = 0.9444

Winkler Adjustment

The Winkler modification boosts the Jaro score for strings sharing a common prefix (up to 4 characters):

\[jw = jaro + \ell \cdot p \cdot (1 - jaro)\]

Where:

  • = length of common prefix (max 4)

  • p = prefix scaling factor (default: 0.1)

Example:

"putin" vs "poutin"
Common prefix: "p" → ℓ = 1
JW = 0.9444 + 1 × 0.1 × (1 - 0.9444) = 0.9500

Threshold-Aware Early Exit

Sieve includes an optimized similarityWithThreshold() method that skips the full computation when the result cannot possibly meet the threshold:

public static double similarityWithThreshold(String s1, String s2, double threshold) {
    // Upper bound: Jaro score <= (2 + min/max) / 3
    int shorter = Math.min(s1.length(), s2.length());
    int longer = Math.max(s1.length(), s2.length());
    double maxPossibleJaro = (2.0 + (double) shorter / longer) / 3.0;
    double maxPossible = maxPossibleJaro + (4 * 0.1 * (1.0 - maxPossibleJaro));
    if (maxPossible < threshold) {
        return 0.0;  // impossible to meet threshold — skip
    }
    return similarity(s1, s2);
}

This eliminates unnecessary comparisons when string lengths diverge too much. For example, "al" (length 2) vs "international corporation" (length 27) has a maximum possible Jaro score of 0.358 — well below any practical threshold.

Memory Optimization

The standard Jaro algorithm allocates two boolean arrays per comparison to track matched characters. At ~250 comparisons per query, this creates significant GC pressure.

Sieve eliminates this by using ThreadLocal reusable arrays:

private static final ThreadLocal<boolean[]> TL_MATCHED_1 =
        ThreadLocal.withInitial(() -> new boolean[128]);

static double jaro(String s1, String s2) {
    boolean[] s1Matched = borrowArray(TL_MATCHED_1, s1.length());
    boolean[] s2Matched = borrowArray(TL_MATCHED_2, s2.length());
    // ... compute matches and transpositions ...
    clearArray(s1Matched, s1.length());  // reset for reuse
    clearArray(s2Matched, s2.length());
}

The arrays grow on demand but are never deallocated, achieving zero allocation in the steady state.

Complexity

Aspect

Complexity

Time

O(n × w) where n = max string length, w = match window

Space

O(n) — reused via ThreadLocal

Typical cost

~1–2µs per comparison for names of length 10–40

References