Jaro-Winkler Similarity¶
Sieve uses the Jaro-Winkler string similarity algorithm for fuzzy name matching. It is implemented from scratch in pure Java with no external dependencies.
Overview¶
Jaro-Winkler produces a similarity score between 0.0 (no similarity) and 1.0 (exact match). It is particularly well-suited for name matching because it rewards strings that share a common prefix — a property that aligns with how human names tend to vary (e.g., “Vladimir” vs “Vladimyr”).
The algorithm consists of two phases:
Jaro similarity — counts matching characters and transpositions
Winkler adjustment — boosts the score for shared prefixes
Jaro Similarity¶
Given two strings s₁ and s₂, the Jaro similarity is:
Where:
m = number of matching characters
t = number of transpositions / 2
Characters are considered matching if they are the same and within ⌊max(|s₁|, |s₂|) / 2⌋ - 1 positions of each other
Example:
s₁ = "putin"
s₂ = "poutin"
Match window = max(5, 6) / 2 - 1 = 2
Matches: p-p, u-u, t-t, i-i, n-n → m = 5
Transpositions: 0
Jaro = (5/5 + 5/6 + 5/5) / 3 = 0.9444
Winkler Adjustment¶
The Winkler modification boosts the Jaro score for strings sharing a common prefix (up to 4 characters):
Where:
ℓ = length of common prefix (max 4)
p = prefix scaling factor (default: 0.1)
Example:
"putin" vs "poutin"
Common prefix: "p" → ℓ = 1
JW = 0.9444 + 1 × 0.1 × (1 - 0.9444) = 0.9500
Threshold-Aware Early Exit¶
Sieve includes an optimized similarityWithThreshold() method that skips
the full computation when the result cannot possibly meet the threshold:
public static double similarityWithThreshold(String s1, String s2, double threshold) {
// Upper bound: Jaro score <= (2 + min/max) / 3
int shorter = Math.min(s1.length(), s2.length());
int longer = Math.max(s1.length(), s2.length());
double maxPossibleJaro = (2.0 + (double) shorter / longer) / 3.0;
double maxPossible = maxPossibleJaro + (4 * 0.1 * (1.0 - maxPossibleJaro));
if (maxPossible < threshold) {
return 0.0; // impossible to meet threshold — skip
}
return similarity(s1, s2);
}
This eliminates unnecessary comparisons when string lengths diverge too much.
For example, "al" (length 2) vs "international corporation" (length 27)
has a maximum possible Jaro score of 0.358 — well below any practical
threshold.
Memory Optimization¶
The standard Jaro algorithm allocates two boolean arrays per comparison to track matched characters. At ~250 comparisons per query, this creates significant GC pressure.
Sieve eliminates this by using ThreadLocal reusable arrays:
private static final ThreadLocal<boolean[]> TL_MATCHED_1 =
ThreadLocal.withInitial(() -> new boolean[128]);
static double jaro(String s1, String s2) {
boolean[] s1Matched = borrowArray(TL_MATCHED_1, s1.length());
boolean[] s2Matched = borrowArray(TL_MATCHED_2, s2.length());
// ... compute matches and transpositions ...
clearArray(s1Matched, s1.length()); // reset for reuse
clearArray(s2Matched, s2.length());
}
The arrays grow on demand but are never deallocated, achieving zero allocation in the steady state.
Complexity¶
Aspect |
Complexity |
|---|---|
Time |
O(n × w) where n = max string length, w = match window |
Space |
O(n) — reused via ThreadLocal |
Typical cost |
~1–2µs per comparison for names of length 10–40 |
References¶
Winkler, W. E. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”