Matching Model

OpenToken uses a multi-rule tokenization strategy to enable privacy-preserving person matching across datasets that contain PII.

Overview

The matching model generates cryptographically secure tokens from personal identifiers (PII) without exposing the underlying data. Different token rules balance precision (fewer false positives) against recall (fewer missed matches).

┌─────────────────────────────────────────────────────────────────┐
│                    Person Record (PII)                          │
│  Name, DOB, SSN, Sex, Postal Code                              │
└───────────────────────┬─────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Normalization                                │
│  - Remove titles/suffixes                                       │
│  - Strip diacritics                                             │
│  - Standardize formats                                          │
└───────────────────────┬─────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Validation                                   │
│  - Date ranges                                                  │
│  - SSN patterns                                                 │
│  - Required fields                                              │
└───────────────────────┬─────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                Token Generation (T1-T5)                         │
│  - Concatenate attributes per rule                              │
│  - HMAC-SHA256 hash                                             │
│  - Optional AES-256 encryption                                  │
└─────────────────────────────────────────────────────────────────┘

Why Multiple Token Rules?

Real-world data is messy:

Names may have typos or variations
Dates may be recorded differently
SSNs may be missing or partially known
Addresses change over time

Using five distinct rules allows matching at different confidence levels:

OpenToken emits tokens with a RuleId of T1–T5. These identifiers are rule names, not “tiers” (they don’t imply an ordering). In practice, different rules tend to trade off precision vs. recall based on which attributes they include.

RuleId	Attributes (normalized signature)	Typical use
T1	Last + First initial + Sex + BirthDate	Higher recall; tolerates first-name variation
T2	Last + First + BirthDate + ZIP-3	Adds geography; helps when sex is unreliable or missing
T3	Last + First + Sex + BirthDate	Higher precision; stricter than T1
T4	SSN (digits) + Sex + BirthDate	Very high precision when SSN is present
T5	Last + First[0:3] + Sex	Highest recall / lowest precision; use cautiously

Token Rules Summary

T1: Last Name + First Initial + Sex + BirthDate

LastName + FirstName (first initial) + Sex + BirthDate

Designed for higher recall when first names vary (e.g., nicknames) by using only the first initial.

Strengths: Tolerates first-name variation; good candidate generator
Limitations: Lower precision than full-name rules (only first initial)

T2: Last Name + First Name + BirthDate + ZIP-3

LastName + FirstName + BirthDate + PostalCode (ZIP-3)

Uses full first name and adds a coarse location signal (ZIP-3).

Strengths: Adds geography; useful when sex is missing/unreliable
Limitations: Requires postal code; ZIP can change over time

T3: Last Name + First Name + Sex + BirthDate

LastName + FirstName + Sex + BirthDate

More specific than T1 (uses full first name), so it tends to be higher precision but less tolerant of first-name variation.

Strengths: Higher precision when names are stable
Limitations: Full first name required; more sensitive to name variation/typos

T4: SSN (Digits) + Sex + BirthDate

SSN (digits only) + Sex + BirthDate

Very high precision when SSN is present and valid.

Strengths: Highest precision
Limitations: Requires SSN (often missing)

T5: Last Name + First 3 Letters + Sex

LastName + FirstName (first 3 chars) + Sex

Uses only the first 3 characters of first name.

Strengths: Highest recall for first-name variation
Limitations: Lower precision; no birth date in the signature

Matching Strategies

Union Matching (High Recall)

Match if any token rule matches:

Match = T1 ∨ T2 ∨ T3 ∨ T4 ∨ T5

Use for: Research studies, broad population analysis

Intersection Matching (High Precision)

Match only if multiple rules match:

Match = (T1 ∨ T2) ∧ (T3 ∨ T4)

Use for: Clinical record linkage, regulatory requirements

Tiered Matching

Apply rules in sequence, stopping at first match:

for rule in [T1, T2, T3, T4, T5]:
    if rule.matches(record_a, record_b):
        return (True, rule.confidence)
return (False, None)

Use for: Adaptive matching based on data quality

Collision Resistance

Token security relies on:

HMAC-SHA256: Cryptographically secure hash function
Secret keys: Organization-specific hashing keys
Normalization: Consistent input formatting
AES-256: Optional encryption layer

Two different people producing the same token (collision) is statistically negligible when:

Keys are properly managed
Input data is correctly normalized

Worked Example: From Person Data to Tokens and Matches

This section walks through a complete example: raw input data → normalization → token generation → matching decisions.

Sample Dataset

Consider four fictional person records from two different systems:

RecordId	FirstName	LastName	BirthDate	Sex	PostalCode	SSN
HOS-101	María	García Jr.	03/22/1988	Female	90210	452-38-7291
HOS-102	tom	O’Reilly	1995-11-03	M	30301-4455	671-82-9134
CLN-201	Maria	Garcia	1988-03-22	F	90210	452-38-7291
CLN-202	Thomas	O’Reilly	11/03/1995	Male	30301	—

HOS- records come from a hospital system; CLN- records come from a clinic.

Step 1: Normalization

OpenToken normalizes each field before token generation. For full rules, see Normalization and Validation.

RecordId	FirstName	LastName	BirthDate	Sex	PostalCode	SSN
HOS-101	MARIA	GARCIA	1988-03-22	F	90210	452-38-7291
HOS-102	TOM	OREILLY	1995-11-03	M	30301	671-82-9134
CLN-201	MARIA	GARCIA	1988-03-22	F	90210	452-38-7291
CLN-202	THOMAS	OREILLY	1995-11-03	M	30301	—

What changed:

María → MARIA: Diacritic removed, uppercased
García Jr. → GARCIA: Suffix removed, diacritic removed, uppercased
tom → TOM: Uppercased
03/22/1988 → 1988-03-22: Date reformatted to ISO 8601
30301-4455 → 30301: ZIP+4 truncated to 5 digits
SSN missing (CLN-202): Noted; T4 will be skipped for this record

Step 2: Token Generation

Each record produces up to five tokens (T1–T5). Tokens are Base64-encoded hashes; the examples below are illustrative placeholders with realistic lengths (~88 characters for encrypted tokens).

For detailed rule compositions, see Token Rules.

HOS-101 (María García, 1988-03-22):

Rule	Token Signature	Illustrative Token
T1	`GARCIA\\|M\\|F\\|1988-03-22`	`Xk9mT2pLc1VhR3dNZUZ...`
T2	`GARCIA\\|MARIA\\|1988-03-22\\|902`	`bHdRa0VuWXBCdkxhTnI...`
T3	`GARCIA\\|MARIA\\|F\\|1988-03-22`	`cTdYc1pNdkpUa2JQeHo...`
T4	`452387291\\|F\\|1988-03-22`	`ZnBOdFdtS2haQWdWcko...`
T5	`GARCIA\\|MAR\\|F`	`RWtqVXhMY0dTcldmbVk...`

CLN-201 (Maria Garcia, 1988-03-22):

Rule	Token Signature	Illustrative Token
T1	`GARCIA\\|M\\|F\\|1988-03-22`	`Xk9mT2pLc1VhR3dNZUZ...`
T2	`GARCIA\\|MARIA\\|1988-03-22\\|902`	`bHdRa0VuWXBCdkxhTnI...`
T3	`GARCIA\\|MARIA\\|F\\|1988-03-22`	`cTdYc1pNdkpUa2JQeHo...`
T4	`452387291\\|F\\|1988-03-22`	`ZnBOdFdtS2haQWdWcko...`
T5	`GARCIA\\|MAR\\|F`	`RWtqVXhMY0dTcldmbVk...`

Observation: HOS-101 and CLN-201 produce identical tokens for all five rules because their normalized attributes are identical.

HOS-102 (tom O’Reilly, 1995-11-03):

Rule	Token Signature	Illustrative Token
T1	`OREILLY\\|T\\|M\\|1995-11-03`	`UXdlcnR5VWlPcEFzRGZ...`
T2	`OREILLY\\|TOM\\|1995-11-03\\|303`	`WnhjdmJubUtMbUpIR2d...`
T3	`OREILLY\\|TOM\\|M\\|1995-11-03`	`QWxza2RqZmhHa0xQb1p...`
T4	`671829134\\|M\\|1995-11-03`	`TW5iVmN4WmFRd0VyVHl...`
T5	`OREILLY\\|TOM\\|M`	`SWp1aHlHdEZyRGVTd1d...`

CLN-202 (Thomas O’Reilly, 1995-11-03, no SSN):

Rule	Token Signature	Illustrative Token
T1	`OREILLY\\|T\\|M\\|1995-11-03`	`RHZiTmNYemFRd0VyWnR...`
T2	`OREILLY\\|THOMAS\\|1995-11-03\\|303`	`S2p1aHlHdEZyRGVWd1h...`
T3	`OREILLY\\|THOMAS\\|M\\|1995-11-03`	`VXl0ckVXcUFzRGZHaEp...`
T4	— (SSN missing)	Not generated
T5	`OREILLY\\|THO\\|M`	`QmFzZTY0UExhY2Vob2w...`

Observation: HOS-102 and CLN-202 can match on T1 (first initial) even though the full first name differs (TOM vs THOMAS). They do not match on rules that require the full first name, and they cannot generate T4 because the SSN is missing.

Step 3: Matching Decisions

When comparing tokens across the two systems:

Record Pair	T1	T2	T3	T4	T5	Match?
HOS-101 ↔ CLN-201	✓	✓	✓	✓	✓	Yes (all rules)
HOS-102 ↔ CLN-202	✓	✗	✗	—	✗	Depends (T1 only)
HOS-101 ↔ HOS-102	✗	✗	✗	✗	✗	No
CLN-201 ↔ CLN-202	✗	✗	✗	✗	✗	No

Interpretation:

HOS-101 and CLN-201 are the same person. Despite surface differences (“María” vs “Maria”, suffix “Jr.”), normalization produces identical attributes, so all tokens match.
HOS-102 and CLN-202 may be the same person, but only match on T1. Depending on your matching policy, you might require multiple-rule agreement (higher precision) or accept single-rule matches (higher recall).

Key Takeaways

Normalization is critical. Two records with superficially different inputs (accents, suffixes, date formats) can match perfectly after normalization.
Missing attributes reduce matching opportunities. CLN-202 couldn’t generate T4 because SSN was missing.
Name variations may prevent matches. “Tom” vs “Thomas” is a common real-world issue; T5’s 3-character prefix helps only if the first 3 letters are identical.
Multiple rules provide fallback. Even if T4 can’t be generated (SSN missing), other rules may still match if other attributes align.

Token Rules — Detailed composition of T1–T5
Normalization and Validation — Full normalization and validation rules

Cross-Organization Matching

For records from different organizations to match:

Same keys: Both organizations must use identical hashing/encryption keys
Same normalization: Attribute processing must be identical
Same rules: Token generation logic must match exactly

OpenToken ensures this through:

Dual Java/Python implementations with byte-identical outputs
Comprehensive normalization documentation
Interoperability testing

OpenToken

Privacy-preserving tokenization and matching library for secure PII-based person linkage

Matching Model

Overview

Why Multiple Token Rules?

Token Rules Summary

T1: Last Name + First Initial + Sex + BirthDate

T2: Last Name + First Name + BirthDate + ZIP-3

T3: Last Name + First Name + Sex + BirthDate

T4: SSN (Digits) + Sex + BirthDate

T5: Last Name + First 3 Letters + Sex

Matching Strategies

Union Matching (High Recall)

Intersection Matching (High Precision)

Tiered Matching

Collision Resistance

Worked Example: From Person Data to Tokens and Matches

Sample Dataset

Step 1: Normalization

Step 2: Token Generation

Step 3: Matching Decisions

Key Takeaways

Cross-Organization Matching

Further Reading

Matching Model

Overview

Why Multiple Token Rules?

Token Rules Summary

T1: Last Name + First Initial + Sex + BirthDate

T2: Last Name + First Name + BirthDate + ZIP-3

T3: Last Name + First Name + Sex + BirthDate

T4: SSN (Digits) + Sex + BirthDate

T5: Last Name + First 3 Letters + Sex

Matching Strategies

Union Matching (High Recall)

Intersection Matching (High Precision)

Tiered Matching

Collision Resistance

Worked Example: From Person Data to Tokens and Matches

Sample Dataset

Step 1: Normalization

Step 2: Token Generation

Step 3: Matching Decisions

Key Takeaways

Related Pages

Cross-Organization Matching

Further Reading