Normalization and Validation
OpenToken applies consistent normalization and validation rules to ensure accurate, reproducible token generation across different data sources.
Overview
Every attribute goes through two phases:
- Normalization: Transform input to a standard format
- Validation: Verify the normalized value meets requirements
Input → Normalize → Validate → Token Generation
↓ ↓
Standard Accept/
Format Reject
Name Attributes
First Name / Last Name
Normalization Steps:
- Trim leading/trailing whitespace
- Convert to uppercase
- Remove diacritics (é → E, ñ → N)
- Remove titles: Dr, Mr, Mrs, Ms, Miss, Prof
- Remove suffixes: Jr, Sr, II, III, IV, PhD, MD
- Remove non-alphabetic characters (spaces, apostrophes, hyphens, periods)
Examples:
| Input | Normalized |
|---|---|
" John " |
"JOHN" |
"María" |
"MARIA" |
"Dr. Smith" |
"SMITH" |
"O'Brien" |
"OBRIEN" |
"García Jr." |
"GARCIA" |
Validation:
- Must not be empty after normalization
- No numeric characters allowed
Birth Date
Normalization:
- Parse multiple date formats
- Output as
yyyy-MM-dd(ISO 8601)
Accepted Input Formats:
| Format | Example |
|---|---|
yyyy-MM-dd |
1985-03-15 |
MM/dd/yyyy |
03/15/1985 |
M/d/yyyy |
3/15/1985 |
dd-MMM-yyyy |
15-Mar-1985 |
Validation Rules:
| Rule | Value |
|---|---|
| Minimum date | 1910-01-01 |
| Maximum date | Today |
| Age range | 0-115 years |
Examples:
| Input | Normalized | Valid |
|---|---|---|
"03/15/1985" |
"1985-03-15" |
✓ |
"1985-03-15" |
"1985-03-15" |
✓ |
"1899-01-01" |
— | ✗ (before 1910) |
"2099-01-01" |
— | ✗ (future date) |
Social Security Number (SSN)
Normalization:
- Remove all non-digit characters
- Format as
XXX-XX-XXXX
Validation Rules:
| Rule | Description |
|---|---|
| Length | Exactly 9 digits |
| Area (first 3) | Not 000, 666, or 900-999 |
| Group (middle 2) | Not 00 |
| Serial (last 4) | Not 0000 |
| Known invalid | Reject common test SSNs |
Invalid SSN Patterns:
000-XX-XXXX (Invalid area)
666-XX-XXXX (Invalid area)
9XX-XX-XXXX (Reserved for ITIN)
XXX-00-XXXX (Invalid group)
XXX-XX-0000 (Invalid serial)
111-11-1111 (Common test pattern)
123-45-6789 (Common test pattern)
Examples:
| Input | Normalized | Valid |
|---|---|---|
"123-45-6789" |
— | ✗ (test SSN) |
"078-05-1120" |
"078-05-1120" |
✓ |
"000-12-3456" |
— | ✗ (area = 000) |
| Digits-only test input | "123-45-6789" |
✗ (test SSN) |
Sex
Normalization:
- Convert to uppercase
- Map to single character
Accepted Values:
| Input | Normalized |
|---|---|
"M", "Male", "m" |
"M" |
"F", "Female", "f" |
"F" |
"U", "Unknown", "u" |
"U" |
Validation:
- Must be M, F, or U after normalization
Postal Code
Normalization by Country:
US ZIP Codes:
- Extract first 5 digits
- Accept 5-digit or ZIP+4 format
Canadian Postal Codes:
- Uppercase
- Format as
A1A 1A1
Examples:
| Input | Normalized | Valid |
|---|---|---|
"98101" |
"98101" |
✓ |
"98101-1234" |
"98101" |
✓ |
"k1a 0b1" |
"K1A 0B1" |
✓ |
"00000" |
— | ✗ (placeholder) |
"12345" |
— | ✗ (test ZIP) |
Invalid Postal Codes:
00000 (Placeholder)
12345 (Common test)
99999 (Placeholder)
ZIP3 (Partial Postal Code)
Purpose: Broader geographic matching with reduced precision
Normalization:
- Extract first 3 digits of postal code
Examples:
| Postal Code | ZIP3 |
|---|---|
98101 |
981 |
98101-1234 |
981 |
10001 |
100 |
Validation Architecture
Composable Validators
Validators can be chained for complex rules:
// Java example
List<Validator> validators = List.of(
new RegexValidator("^[A-Z]+$"),
new MinLengthValidator(2),
new MaxLengthValidator(50)
);
# Python example
validators = [
RegexValidator(r"^[A-Z]+$"),
MinLengthValidator(2),
MaxLengthValidator(50)
]
Built-in Validators
| Validator | Purpose |
|---|---|
RegexValidator |
Pattern matching |
DateRangeValidator |
Date bounds |
AgeRangeValidator |
Age at reference date |
MinLengthValidator |
Minimum string length |
MaxLengthValidator |
Maximum string length |
Thread Safety
All normalization and validation operations are thread-safe:
- Regex patterns are pre-compiled and immutable
- Date formatters use
DateTimeFormatter(thread-safe) - No shared mutable state
Handling Invalid Data
When validation fails:
- Token is not generated for that rule
- Processing continues to other rules
- Metadata tracks aggregate counts (not per-record reasons)
- Original data preserved (not modified)
{
"TotalRows": 105,
"TotalRowsWithInvalidAttributes": 12,
"InvalidAttributesByType": {
"BirthDate": 3,
"FirstName": 1,
"LastName": 2,
"PostalCode": 2,
"Sex": 0,
"SocialSecurityNumber": 4
},
"BlankTokensByRule": {
"T1": 6,
"T2": 8,
"T3": 6,
"T4": 7,
"T5": 3
}
}
Notes:
.metadata.jsondoes not include per-record invalid reasons (likeINVALID_LENGTH) or per-record token skip details.- To understand why a specific row didn’t generate a token, inspect that row’s output token columns (blank means not generated).
- For the full schema, see Metadata Format.
Further Reading
- Matching Model - How normalized data becomes tokens
- Token Rules - Which attributes each rule uses
- API Reference - Programmatic attribute handling