Overview

What is OpenToken?

OpenToken is a privacy-preserving tokenization and matching library for secure person linkage using PII-derived attributes. It generates cryptographically secure matching tokens from person attributes, enabling matching across datasets without directly comparing names, birthdates, SSNs, and other sensitive identifiers.

Both Java and Python implementations produce byte-identical tokens for the same normalized input, enabling flexible deployment and cross-language workflows.

The Problem

Organizations often need to match people across datasets—finding the same person across systems and time. Direct comparison of names and birthdates raises privacy concerns and is error-prone due to typos and data quality variations. OpenToken solves this by generating deterministic cryptographic tokens from person data.

The Solution

Instead of storing or comparing raw person attributes:

John Doe | 1975-03-15 | 98004 → [STORED OR COMPARED]

OpenToken generates secure tokens derived from those attributes:

John Doe | 1975-03-15 | 98004 → SHA-256 HASH → HMAC-SHA256 → AES-256 ENCRYPT → Token

Matching is done by comparing the encrypted tokens, not the original data.

How It Works

Input: Person records with attributes (name, birthdate, SSN, postal code, sex)
Validation & Normalization: Attributes are validated and normalized (uppercase, diacritic removal, title stripping)
Token Generation: Multiple token rules (T1–T5) combine different attributes
Encryption: Tokens are hashed and encrypted using HMAC-SHA256 and AES-256
Output: Token signatures for matching and metadata

Key Concepts

Token Generation Rules

OpenToken uses 5 distinct token rules (T1–T5) that define which attributes combine to form each token. Each rule targets different matching scenarios:

Rule	Definition	Use Case
T1	Last name + first initial + sex + birthdate	Standard matching
T2	Last name + full first name + birthdate + ZIP-3	Data with varied names
T3	Last name + full first name + sex + birthdate	Higher precision
T4	SSN + sex + birthdate	Authoritative identifier
T5	Last name + first 3 letters + sex	Quick search

Validation & Normalization

Before tokens are generated, attributes are validated against practical, PII-focused rules:

FirstName/LastName: No placeholders, proper length, diacritics normalized
BirthDate: 1910–today, valid format (YYYY-MM-DD)
SSN: Valid US social security number (area, group, serial checks)
PostalCode: Valid US ZIP or Canadian postal code
Sex: Male or Female

Invalid records are tracked and reported in metadata.

Encryption Process

The token is transformed through a secure pipeline:

Token Signature → SHA-256 Hash → HMAC-SHA256 → AES-256 Encrypt → Base64 Encode

Or in hash-only mode:

Token Signature → SHA-256 Hash → HMAC-SHA256 → Base64 Encode

Data Flow

Input CSV/Parquet
       ↓
Validate & Normalize
       ↓
Generate Token Signatures (T1-T5)
       ↓
Hash & Encrypt
       ↓
Output CSV/Parquet + Metadata

Multi-Language Parity

OpenToken is implemented in Java and Python. Both produce byte-identical tokens for the same normalized input using the same hashing and encryption keys. This enables:

Flexible deployment (choose Java or Python)
Cross-language processing (encrypt in one language, decrypt in another)
Distributed processing with PySpark

Security Properties

No Reversal: Tokens cannot be decrypted back to original data without the encryption key
Deterministic: Same input always produces the same token (enables matching)
Privacy-Focused: Designed for regulated environments where PII must be protected
Validation: Rejects invalid or placeholder values before processing

Who Uses OpenToken?

Data Engineers: Building person matching pipelines
Privacy/Infra Engineers: Securing sensitive data in regulated systems
Data/Platform Teams: Linking records across datasets while preserving privacy
Researchers: Linking datasets for cohort studies without exposing raw identifiers

Next Steps

→ Quickstarts – Try OpenToken in 5 minutes. Choose CLI (Docker), Python, or Java.

Once you’ve run through a quickstart:

Token Rules – Deep dive into T1–T5 and matching strategies
Security – Understand validation rules and cryptography