OpenToken Specification

Overview

OpenToken is a privacy-preserving token generation system for deterministic person matching across datasets. This specification defines the scope, inputs, processing steps, and outputs of the token generation pipeline.

Purpose: Generate cryptographically secure tokens from person attributes such that:

Identical inputs always produce identical tokens (deterministic)
Tokens reveal nothing about the underlying data (one-way)
Matching can occur on different attribute combinations via 5 distinct token rules (T1–T5)

Applicability: This specification applies to both Java and Python implementations. Cross-language output must be byte-identical for the same normalized inputs and secrets.

Scope and Goals

In Scope

Person attribute normalization: Transformation of raw input data into canonical forms
Token rule definitions: Five rules (T1–T5) combining attributes in distinct ways
Token generation pipeline: Deterministic transformation of normalized attributes → final tokens
Metadata tracking: Processing statistics, system info, and secret hashes for audit
Error handling: Behavior when attributes fail validation
Output formats: CSV and Parquet serialization

Out of Scope

User authentication or access control
Network transport or API specification (see implementation-specific documentation)
Data backup, archival, or long-term storage strategy
Performance tuning or optimization parameters (see Configuration)
Distributed/parallel processing details (handled by PySpark implementation separately)

Input Expectations

File Formats

Supported input formats:

CSV (comma-separated values, with header row)
Parquet (columnar binary format)

Size and Processing Model

OpenToken is designed for streaming-style processing: it reads records, normalizes/validates, emits up to 5 tokens, and writes output without needing to hold the full dataset in memory.

Practical constraints:

There is no fixed maximum file size imposed by OpenToken itself; limits are driven by your machine/cluster resources (CPU, memory, disk) and the underlying CSV/Parquet libraries.
Output size is roughly 5× the number of input rows (one row per rule per record) plus metadata.
For Parquet, performance and memory usage depend on row group sizing and the reader implementation.

Recommendations:

Prefer Parquet for large jobs (faster parsing, smaller I/O, better parallelism).
Ensure disk space for outputs (tokens + .metadata.json).
For very large datasets, use the PySpark integration to scale horizontally.

Required Attributes

All of the following must be provided per record:

Attribute	Type	Constraints	Examples	Normalization
FirstName	String	Non-empty after normalization	“John”, “José”, “JoAnn”	Remove titles, suffixes, diacritics; uppercase
LastName	String	Non-empty after normalization	“Smith”, “O’Brien”, “García”	Remove suffixes, diacritics; uppercase
BirthDate	Date	1910-01-01 to today	“1980-01-15”, “01/15/1980”, “15.01.1980”	ISO 8601 YYYY-MM-DD
Sex	String	“Male” or “Female” (case-insensitive)	“M”, “F”, “male”, “FEMALE”	Uppercase; normalize M→MALE, F→FEMALE
PostalCode	String	Valid US ZIP or Canadian postal code	“98004”, “K1A 1A1”, “98004-1234”	Remove dashes; pad ZIP to 5 digits
SSN	String	9 numeric digits (US Social Security Number)	“123-45-6789” (digits-only inputs normalized)	Remove dashes

Optional Attributes

RecordId: Unique identifier per record (defaults to UUID if omitted)

Validation Rules

Attributes are validated after normalization. See Concepts: Normalization and Validation for detailed rules:

FirstName/LastName: At least one alphabetic character after diacritic removal
BirthDate: Valid date within allowed range
Sex: Exactly “MALE” or “FEMALE” after normalization
PostalCode: Valid US ZIP-5 or Canadian postal code format
SSN: Area code ≠ 000/666/900–999; group ≠ 00; serial ≠ 0000; reject common placeholders

If any attribute fails validation, the record is marked invalid in metadata, and affected token rules produce blank tokens.

Processing Steps

1. Input Parsing

Read CSV or Parquet file with header
Validate schema (all required columns present)
Stream or batch records (implementation-dependent)

2. Attribute Normalization

Each attribute is normalized according to its type:

Names (FirstName, LastName): Remove titles/suffixes, diacritics, extra whitespace; uppercase
BirthDate: Parse input format (multiple formats supported) → ISO 8601 YYYY-MM-DD
Sex: Parse variants (M/male/Male → MALE; F/female/Female → FEMALE)
PostalCode: Remove dashes, zero-pad ZIP codes to 5 digits, uppercase Canadian postal codes
SSN: Remove dashes, validate 9-digit format

Details: See Concepts: Normalization and Validation

3. Attribute Validation

Normalized attributes are validated against business rules:

Non-empty names
Valid date ranges
Valid postal code formats
SSN validation (area/group/serial constraints)

Invalid records are flagged and tracked in metadata; blank tokens are generated for affected rules.

4. Token Rule Application

Apply each of the 5 token rules independently:

Rule	Attributes	Notes
T1	U(LastName) \| U(FirstName[0]) \| U(Sex) \| BirthDate	Standard match; higher recall
T2	U(LastName) \| U(FirstName) \| BirthDate \| PostalCode[0:3]	Geographic variation; uses ZIP-3
T3	U(LastName) \| U(FirstName) \| U(Sex) \| BirthDate	Higher precision match; full name + sex
T4	SocialSecurityNumber \| U(Sex) \| BirthDate	Authoritative; uses SSN
T5	U(LastName) \| U(FirstName[0:3]) \| U(Sex)	Quick search; no birth date

(U = Uppercase, [0] = first char, [0:3] = first 3 chars)

Details: See Concepts: Token Rules

5. Token Encryption / Hash Transformation

Each token rule signature is transformed through the cryptographic pipeline.

Default mode (encrypted):

Signature → SHA-256 → HMAC-SHA256 → AES-256-GCM → Base64

Hash-only mode (optional):

Signature → SHA-256 → HMAC-SHA256 → Base64

Parameters required:

hashing_secret: String (8+ characters recommended) used for HMAC
encryption_key: String exactly 32 characters long (or byte array 32 bytes) used for AES-256 encryption

Details: See Security: Cryptographic Building Blocks

6. Metadata Generation

During processing, OpenToken tracks:

Counts: Total rows, invalid attributes per type, blank tokens per rule
System Info: Platform (Java/Python), language version, library version
Secrets: SHA-256 hashes of hashing secret and encryption key (not the secrets themselves)
Timestamps: Processing start, completion, all in UTC
Paths: Input/output file paths

Metadata is written to .metadata.json alongside output files.

Details: See Reference: Metadata Format

Outputs

Token Output (CSV)

Schema:

RecordId,RuleId,Token

Columns:

RecordId: From input (or auto-generated if omitted)
RuleId: T1, T2, T3, T4, or T5
Token: Base64-encoded token (or empty string if validation failed)

Rows per input record: 5 (one per rule); may be fewer if errors occur

Example:

RecordId,RuleId,Token
ID001,T1,aB7c9Dz1e4...
ID001,T2,fG3h5kL2m9...
ID001,T3,nP6q8sT1u0...
ID001,T4,vW9xY2zAbC...
ID001,T5,DeF3gHi6jK...

Token Output (Parquet)

Same schema as CSV but with native Parquet types:

RecordId (string)
RuleId (string)
Token (string)

Parquet format includes compression and is suitable for large datasets.

Metadata Output

Filename: <output_basename>.metadata.json

Contents:

Processing statistics (record counts, invalid attributes, blank tokens)
System information (platform, versions, timestamps)
Secret hashes (SHA-256 of hashing secret and encryption key)
File paths (input, output, metadata)

Example:

{
  "Platform": "Java",
  "JavaVersion": "21.0.0",
  "OpenTokenVersion": "1.7.0",
  "TotalRows": 100,
  "TotalRowsWithInvalidAttributes": 3,
  "InvalidAttributesByType": {
    "BirthDate": 2,
    "PostalCode": 1
  },
  "BlankTokensByRule": {
    "T1": 2,
    "T2": 1
  },
  "HashingSecretHash": "abc123...",
  "EncryptionSecretHash": "def456..."
}

Details: See Reference: Metadata Format

Versioning Notes

Current Version

OpenToken Specification v1.0 (as of 2024)

5 token rules (T1–T5) finalized
Attribute set: FirstName, LastName, BirthDate, Sex, PostalCode, SSN
Normalization rules documented
Cryptographic pipeline: SHA-256 → HMAC-SHA256 → AES-256-GCM

Compatibility

Java: JDK 21+
Python: 3.10+
Cross-language parity: Java and Python implementations MUST produce byte-identical tokens for the same normalized inputs

Future Considerations

This section is non-normative (informational) and describes likely evolution areas:

Extension mechanism for new token rules (T6+) with explicit cross-language parity requirements
Support for additional attribute types (e.g., middle name, phone, email) behind versioned schemas
Metadata schema versioning for forward compatibility
Formal specification versioning and migration guidance
Published performance guidance (methodology, baselines by environment)

Breaking Changes

Any changes to:

Normalization rules
Token rule definitions
Cryptographic algorithms
Metadata schema

…will require a major version bump and clear migration path.

Cross-References

For deeper information, see:

Token Rules: Concepts: Token Rules
Normalization: Concepts: Normalization and Validation
Cryptography & Security: Security
Metadata Fields: Reference: Metadata Format
Configuration: Configuration
CLI Usage: Running OpenToken
Operations: Running Batch Jobs

Document History

Date	Version	Changes
2024-01-15	1.0	Initial specification
Planned	1.1	Formalize version field in metadata