Name Normalization¶

Name normalization is the first step in the matching pipeline. It transforms raw name strings into a canonical form so that comparisons are consistent regardless of casing, spacing, or formatting differences.

What Normalization Does¶

Input:  "  Vladimir   PUTIN  "
Output: "vladimir putin"

Three operations are applied in sequence:

Strip — remove leading and trailing whitespace
Lowercase — convert to lowercase (locale-independent)
Collapse whitespace — replace sequences of whitespace with a single space

Implementation¶

// NameNormalizer.java
private static final Pattern WHITESPACE = Pattern.compile("\\s+");

static String normalize(String name) {
    if (name == null || name.isBlank()) {
        return "";
    }
    return WHITESPACE.matcher(name.strip().toLowerCase()).replaceAll(" ");
}

Key design decisions:

Pre-compiled regex — the \s+ pattern is compiled once as a static field, avoiding recompilation overhead on every call.
Memoized results — a ConcurrentHashMap caches up to 100,000 normalized strings. Entity names are cached at index load time; query strings benefit from cache hits on repeated screening.
Null-safe — null or blank inputs return an empty string, never null.

Pre-normalization Cache¶

At index load time, the NormalizedNameCache pre-computes and stores the normalized form of every entity’s primary name and aliases:

        flowchart LR
    L["Index Load"] --> C["NormalizedNameCache"]
    C --> P["primaryName → normalized"]
    C --> A["aliases[] → normalized[]"]
    Q["Query time"] -->|"cache.get(entity)"| R["instant lookup"]

This eliminates ~100,000 string allocations and regex operations per query that were previously needed to normalize every entity name on every request.

// At query time — zero normalization cost per entity
NormalizedNameCache.NormalizedEntry cached = nameCache.get(entity);
double score = JaroWinkler.similarity(normalizedQuery, cached.primaryName());

Thread Safety¶

NormalizedNameCache uses a ConcurrentHashMap for thread-safe concurrent reads.
The cache rebuilds automatically (with double-checked locking) when the index size changes, e.g., after a list refresh.
NameNormalizer’s static cache is also a ConcurrentHashMap with a bounded size to prevent unbounded memory growth.