Security

Cryptographic building blocks, key management expectations, and security considerations for privacy-preserving person matching.

Overview

OpenToken generates cryptographically secure tokens for privacy-preserving person matching across datasets. The system uses deterministic hashing and optional encryption to prevent re-identification while enabling matching on identical person attributes.

Key security properties:

Tokens are one-way (cannot reverse to original data without secrets)
Same input produces same token (deterministic matching)
Metadata tracks processing statistics without exposing person data

Cryptographic Building Blocks

Token Transformation Pipeline

OpenToken transforms person attributes through multiple layers:

Encryption mode (default):

Token Signature (normalized attributes)
  ↓
SHA-256 Hash (one-way digest, 256-bit)
  ↓
HMAC-SHA256 (authenticated hash with hashing secret)
  ↓
AES-256-GCM Encrypt (symmetric encryption with encryption key)
  ↓
Base64 Encode (storable format)

Hash-only mode (alternative):

Token Signature
  ↓
SHA-256 Hash
  ↓
HMAC-SHA256 (with hashing secret)
  ↓
Base64 Encode

SHA-256 (Secure Hash Algorithm)

Standard: FIPS 180-4
Output: 256-bit (32-byte) fixed-size digest
Collision resistance: ~2^128 computational effort
Purpose: Convert variable-length token signatures to fixed-size digests

Properties:

One-way function (cannot reverse hash to input)
Avalanche effect (small input change produces completely different hash)
Deterministic (same input always produces same hash)

HMAC-SHA256 (Hash-based Message Authentication Code)

Standard: FIPS 198-1
Input: SHA-256 hash + hashing secret
Output: 256-bit authenticated hash
Purpose: Prevent rainbow table attacks and verify secret usage

Security benefits:

Requires secret key to generate matching hashes
Prevents pre-computation of token values
Different secret produces completely different output for same input

Formula:

HMAC-SHA256(message, key) = SHA256((key ⊕ opad) || SHA256((key ⊕ ipad) || message))

AES-256-GCM (Advanced Encryption Standard with Galois/Counter Mode)

Standard: FIPS 197
Key size: 256-bit (32-byte)
Mode: GCM (Galois/Counter Mode) with authentication
Purpose: Encrypt tokens to prevent re-identification

Technical details:

Initialization Vector (IV): 12 bytes, randomly generated per token
Authentication tag: 128-bit (16-byte) GCM tag for integrity
Padding: NoPadding (GCM mode handles message length)
Algorithm: AES/GCM/NoPadding

Security properties:

Authenticated encryption (detects tampering)
Unique IV per token prevents pattern analysis
Computationally infeasible to brute-force (2^256 possible keys)
Reversible only with correct encryption key

Key Management & Secrets

This section consolidates practical guidance for managing the cryptographic secrets OpenToken requires.

Types of Secrets

OpenToken expects two secrets (one required, one optional depending on mode):

Secret	CLI Flag	Purpose	Requirements
Hashing Secret	`-h` / `--hashingsecret`	HMAC-SHA256 key for deterministic hashing	Required in all modes; 8+ characters recommended, 16+ ideal
Encryption Key	`-e` / `--encryptionkey`	AES-256-GCM symmetric key	Required for encryption mode; exactly 32 characters

Hash-only mode (--hash-only) skips AES encryption; only the hashing secret is needed.

Handling Secrets in Practice

Development / Local Testing

Use clearly marked placeholder values:

# Placeholder secrets for local testing only
java -jar opentoken-cli-*.jar \
  -i sample.csv -t csv -o output.csv \
  -h "HashingKey" \
  -e "Secret-Encryption-Key-Goes-Here."

Store these in a local .env file (not committed):

# .env (add to .gitignore)
OPENTOKEN_HASHING_SECRET=HashingKey
OPENTOKEN_ENCRYPTION_KEY=Secret-Encryption-Key-Goes-Here.

Load and use:

source .env
java -jar opentoken-cli-*.jar \
  -i sample.csv -t csv -o output.csv \
  -h "$OPENTOKEN_HASHING_SECRET" \
  -e "$OPENTOKEN_ENCRYPTION_KEY"

Production

Store secrets in a managed secret store and inject via environment variables at runtime:

Platform	Secret Store	Injection Method
AWS	Secrets Manager	`aws secretsmanager get-secret-value` or ECS/Lambda secrets
Azure	Key Vault	`az keyvault secret show` or App Service key references
GCP	Secret Manager	`gcloud secrets versions access` or workload identity
On-prem	HashiCorp Vault	`vault kv get` or agent auto-auth
Databricks	Databricks Secrets	`dbutils.secrets.get("scope", "key")`

Example (AWS Secrets Manager):

export OPENTOKEN_HASHING_SECRET=$(aws secretsmanager get-secret-value \
  --secret-id opentoken-hash-key --query SecretString --output text)
export OPENTOKEN_ENCRYPTION_KEY=$(aws secretsmanager get-secret-value \
  --secret-id opentoken-enc-key --query SecretString --output text)

java -jar opentoken-cli-*.jar \
  -i data.csv -t csv -o tokens.csv \
  -h "$OPENTOKEN_HASHING_SECRET" \
  -e "$OPENTOKEN_ENCRYPTION_KEY"

Example (Databricks):

from opentoken_pyspark import SparkPersonTokenProcessor

processor = SparkPersonTokenProcessor(
    spark=spark,
    hashing_secret=dbutils.secrets.get("opentoken", "hashing_secret"),
    encryption_key=dbutils.secrets.get("opentoken", "encryption_key")
)

Secret Rotation

Generate new secrets – use a cryptographically secure generator.
Re-run token generation – tokens are deterministic; same input + same secrets = same tokens. New secrets = new tokens.
Version secrets in your store – keep old versions for auditability.
Coordinate downstream – any system that decrypts tokens needs the matching encryption key.

What NOT to Do

Never commit secrets to source control. Add .env and similar files to .gitignore.
Never log secrets. CLI output and metadata files contain hashes of secrets, not the secrets themselves.
Never hard-code secrets in scripts checked into git. Use environment variables or secret-store references.

Secret Verification via Metadata

Each run produces a .metadata.json with SHA-256 hashes of secrets:

{
  "HashingSecretHash": "e0b4e60b...",
  "EncryptionSecretHash": "a1b2c3d4..."
}

Use tools/hash_calculator.py to verify:

python tools/hash_calculator.py \
  --hashing-secret "YourSecret" \
  --encryption-key "YourEncryptionKey"
# Compare output hashes to metadata file

Cross-References

CLI flags for secrets: CLI Reference
Environment variable usage: Configuration
Databricks / Spark secrets: Spark or Databricks
Running the CLI: Running OpenToken
Metadata format (hash fields): Reference: Metadata Format

Security Considerations and Limitations

What OpenToken Protects Against

✓ Re-identification without secrets:

Encrypted tokens cannot be reversed without encryption key
Hashed tokens cannot be reversed (one-way HMAC-SHA256)
Attacker with tokens alone cannot recover person data

✓ Rainbow table attacks:

HMAC-SHA256 with secret prevents pre-computed lookup tables
Different secret produces different tokens for same input

✓ Data quality issues: Metadata captures processing statistics; data quality guidance lives in the concepts documentation.

What OpenToken Does NOT Protect Against

✗ Compromise of secrets:

If attacker obtains hashing secret + encryption key, they can regenerate tokens from known person data
Token security depends entirely on secret protection

✗ Side-channel attacks:

Timing attacks, memory access patterns not specifically mitigated
Use secure execution environments for sensitive workloads

✗ Statistical analysis with auxiliary data:

If attacker has auxiliary demographic data and token frequency distributions, statistical attacks may be possible
Consider differential privacy techniques for high-risk scenarios

✗ Token distribution analysis:

Tokens are deterministic (same person always produces same token)
Frequency analysis may reveal population patterns
Mitigate by limiting token distribution and enforcing access controls

User Responsibilities

OpenToken provides cryptographic primitives but users are responsible for:

Secret management: Storing, rotating, and protecting hashing secrets and encryption keys
Access control: Limiting who can generate, access, or decrypt tokens
Token storage: Encrypting token files at rest (file system encryption, database encryption)
Audit logging: Tracking token generation, access, and decryption events
Data minimization: Deleting raw person data after token generation
Compliance: Ensuring usage aligns with HIPAA, GDPR, or organizational policies

Threat Model Assumptions

Assumptions:

Secrets are stored securely and not accessible to unauthorized parties
Execution environment is trusted (no malware or unauthorized access)
Token outputs are protected with access controls
Users validate data quality before token generation

Out of scope:

Protection against compromised execution environments
Protection after decryption (decrypted tokens are plaintext hashes)
Protection against authorized users misusing tokens

Data Quality: Normalization and Validation

Normalization and validation rules are documented separately to keep this page focused on cryptography and secret management.

See Concepts: Normalization and Validation.

Next Steps

View detailed crypto pipeline: Specification
Understand metadata security: Reference: Metadata Format
Review validation rules: Concepts: Normalization and Validation
Configure OpenToken: Configuration
Share tokens across organizations: Sharing Tokenized Data