Normalization and Validation

OpenToken applies consistent normalization and validation rules to ensure accurate, reproducible token generation across different data sources.

Overview

Every attribute goes through two phases:

Normalization: Transform input to a standard format
Validation: Verify the normalized value meets requirements

Input → Normalize → Validate → Token Generation
         ↓            ↓
      Standard     Accept/
       Format      Reject

Name Attributes

First Name / Last Name

Normalization Steps:

Trim leading/trailing whitespace
Convert to uppercase
Remove diacritics (é → E, ñ → N)
Remove titles: Dr, Mr, Mrs, Ms, Miss, Prof
Remove suffixes: Jr, Sr, II, III, IV, PhD, MD
Remove non-alphabetic characters (spaces, apostrophes, hyphens, periods)

Examples:

Input	Normalized
`" John "`	`"JOHN"`
`"María"`	`"MARIA"`
`"Dr. Smith"`	`"SMITH"`
`"O'Brien"`	`"OBRIEN"`
`"García Jr."`	`"GARCIA"`

Validation:

Must not be empty after normalization
No numeric characters allowed

Birth Date

Normalization:

Parse multiple date formats
Output as yyyy-MM-dd (ISO 8601)

Accepted Input Formats:

Format	Example
`yyyy-MM-dd`	1985-03-15
`MM/dd/yyyy`	03/15/1985
`M/d/yyyy`	3/15/1985
`dd-MMM-yyyy`	15-Mar-1985

Validation Rules:

Rule	Value
Minimum date	1910-01-01
Maximum date	Today
Age range	0-115 years

Examples:

Input	Normalized	Valid
`"03/15/1985"`	`"1985-03-15"`	✓
`"1985-03-15"`	`"1985-03-15"`	✓
`"1899-01-01"`	—	✗ (before 1910)
`"2099-01-01"`	—	✗ (future date)

Normalization:

Remove all non-digit characters
Format as XXX-XX-XXXX

Validation Rules:

Rule	Description
Length	Exactly 9 digits
Area (first 3)	Not 000, 666, or 900-999
Group (middle 2)	Not 00
Serial (last 4)	Not 0000
Known invalid	Reject common test SSNs

Invalid SSN Patterns:

000-XX-XXXX  (Invalid area)
666-XX-XXXX  (Invalid area)
9XX-XX-XXXX  (Reserved for ITIN)
XXX-00-XXXX  (Invalid group)
XXX-XX-0000  (Invalid serial)
111-11-1111  (Common test pattern)
123-45-6789  (Common test pattern)

Examples:

Input	Normalized	Valid
`"123-45-6789"`	—	✗ (test SSN)
`"078-05-1120"`	`"078-05-1120"`	✓
`"000-12-3456"`	—	✗ (area = 000)
Digits-only test input	`"123-45-6789"`	✗ (test SSN)

Sex

Normalization:

Convert to uppercase
Map to single character

Accepted Values:

Input	Normalized
`"M"`, `"Male"`, `"m"`	`"M"`
`"F"`, `"Female"`, `"f"`	`"F"`
`"U"`, `"Unknown"`, `"u"`	`"U"`

Validation:

Must be M, F, or U after normalization

Postal Code

Normalization by Country:

US ZIP Codes:

Extract first 5 digits
Accept 5-digit or ZIP+4 format

Canadian Postal Codes:

Uppercase
Format as A1A 1A1

Examples:

Input	Normalized	Valid
`"98101"`	`"98101"`	✓
`"98101-1234"`	`"98101"`	✓
`"k1a 0b1"`	`"K1A 0B1"`	✓
`"00000"`	—	✗ (placeholder)
`"12345"`	—	✗ (test ZIP)

Invalid Postal Codes:

(Placeholder)
(Common test)
(Placeholder)

ZIP3 (Partial Postal Code)

Purpose: Broader geographic matching with reduced precision

Normalization:

Extract first 3 digits of postal code

Examples:

Postal Code	ZIP3
`98101`	`981`
`98101-1234`	`981`
`10001`	`100`

Validation Architecture

Composable Validators

Validators can be chained for complex rules:

// Java example
List<Validator> validators = List.of(
    new RegexValidator("^[A-Z]+$"),
    new MinLengthValidator(2),
    new MaxLengthValidator(50)
);

# Python example
validators = [
    RegexValidator(r"^[A-Z]+$"),
    MinLengthValidator(2),
    MaxLengthValidator(50)
]

Built-in Validators

Validator	Purpose
`RegexValidator`	Pattern matching
`DateRangeValidator`	Date bounds
`AgeRangeValidator`	Age at reference date
`MinLengthValidator`	Minimum string length
`MaxLengthValidator`	Maximum string length

Thread Safety

All normalization and validation operations are thread-safe:

Regex patterns are pre-compiled and immutable
Date formatters use DateTimeFormatter (thread-safe)
No shared mutable state

Handling Invalid Data

When validation fails:

Token is not generated for that rule
Processing continues to other rules
Metadata tracks aggregate counts (not per-record reasons)
Original data preserved (not modified)

{
  "TotalRows": 105,
  "TotalRowsWithInvalidAttributes": 12,
  "InvalidAttributesByType": {
    "BirthDate": 3,
    "FirstName": 1,
    "LastName": 2,
    "PostalCode": 2,
    "Sex": 0,
    "SocialSecurityNumber": 4
  },
  "BlankTokensByRule": {
    "T1": 6,
    "T2": 8,
    "T3": 6,
    "T4": 7,
    "T5": 3
  }
}

Notes:

.metadata.json does not include per-record invalid reasons (like INVALID_LENGTH) or per-record token skip details.
To understand why a specific row didn’t generate a token, inspect that row’s output token columns (blank means not generated).
For the full schema, see Metadata Format.

OpenToken

Privacy-preserving tokenization and matching library for secure PII-based person linkage

Normalization and Validation

Overview

Name Attributes

First Name / Last Name

Birth Date

Sex

Postal Code

ZIP3 (Partial Postal Code)

Validation Architecture

Composable Validators

Built-in Validators

Thread Safety

Handling Invalid Data

Further Reading

Normalization and Validation

Overview

Name Attributes

First Name / Last Name

Birth Date

Social Security Number (SSN)

Sex

Postal Code

ZIP3 (Partial Postal Code)

Validation Architecture

Composable Validators

Built-in Validators

Thread Safety

Handling Invalid Data

Further Reading