Name Normalization¶
Name normalization is the first step in the matching pipeline. It transforms raw name strings into a canonical form so that comparisons are consistent regardless of casing, spacing, or formatting differences.
What Normalization Does¶
Input: " Vladimir PUTIN "
Output: "vladimir putin"
Three operations are applied in sequence:
Strip — remove leading and trailing whitespace
Lowercase — convert to lowercase (locale-independent)
Collapse whitespace — replace sequences of whitespace with a single space
Implementation¶
// NameNormalizer.java
private static final Pattern WHITESPACE = Pattern.compile("\\s+");
static String normalize(String name) {
if (name == null || name.isBlank()) {
return "";
}
return WHITESPACE.matcher(name.strip().toLowerCase()).replaceAll(" ");
}
Key design decisions:
Pre-compiled regex — the
\s+pattern is compiled once as a static field, avoiding recompilation overhead on every call.Memoized results — a
ConcurrentHashMapcaches up to 100,000 normalized strings. Entity names are cached at index load time; query strings benefit from cache hits on repeated screening.Null-safe — null or blank inputs return an empty string, never
null.
Pre-normalization Cache¶
At index load time, the NormalizedNameCache pre-computes and stores the
normalized form of every entity’s primary name and aliases:
flowchart LR
L["Index Load"] --> C["NormalizedNameCache"]
C --> P["primaryName → normalized"]
C --> A["aliases[] → normalized[]"]
Q["Query time"] -->|"cache.get(entity)"| R["instant lookup"]
This eliminates ~100,000 string allocations and regex operations per query that were previously needed to normalize every entity name on every request.
// At query time — zero normalization cost per entity
NormalizedNameCache.NormalizedEntry cached = nameCache.get(entity);
double score = JaroWinkler.similarity(normalizedQuery, cached.primaryName());
Thread Safety¶
NormalizedNameCacheuses aConcurrentHashMapfor thread-safe concurrent reads.The cache rebuilds automatically (with double-checked locking) when the index size changes, e.g., after a list refresh.
NameNormalizer’s static cache is also aConcurrentHashMapwith a bounded size to prevent unbounded memory growth.