Security
Cryptographic building blocks, key management expectations, and security considerations for privacy-preserving person matching.
Overview
OpenToken generates cryptographically secure tokens for privacy-preserving person matching across datasets. The system uses deterministic hashing and optional encryption to prevent re-identification while enabling matching on identical person attributes.
Key security properties:
- Tokens are one-way (cannot reverse to original data without secrets)
- Same input produces same token (deterministic matching)
- Metadata tracks processing statistics without exposing person data
Cryptographic Building Blocks
Token Transformation Pipeline
OpenToken transforms person attributes through multiple layers:
Encryption mode (default):
Token Signature (normalized attributes)
↓
SHA-256 Hash (one-way digest, 256-bit)
↓
HMAC-SHA256 (authenticated hash with hashing secret)
↓
AES-256-GCM Encrypt (symmetric encryption with encryption key)
↓
Base64 Encode (storable format)
Hash-only mode (alternative):
Token Signature
↓
SHA-256 Hash
↓
HMAC-SHA256 (with hashing secret)
↓
Base64 Encode
SHA-256 (Secure Hash Algorithm)
- Standard: FIPS 180-4
- Output: 256-bit (32-byte) fixed-size digest
- Collision resistance: ~2^128 computational effort
- Purpose: Convert variable-length token signatures to fixed-size digests
Properties:
- One-way function (cannot reverse hash to input)
- Avalanche effect (small input change produces completely different hash)
- Deterministic (same input always produces same hash)
HMAC-SHA256 (Hash-based Message Authentication Code)
- Standard: FIPS 198-1
- Input: SHA-256 hash + hashing secret
- Output: 256-bit authenticated hash
- Purpose: Prevent rainbow table attacks and verify secret usage
Security benefits:
- Requires secret key to generate matching hashes
- Prevents pre-computation of token values
- Different secret produces completely different output for same input
Formula:
HMAC-SHA256(message, key) = SHA256((key ⊕ opad) || SHA256((key ⊕ ipad) || message))
AES-256-GCM (Advanced Encryption Standard with Galois/Counter Mode)
- Standard: FIPS 197
- Key size: 256-bit (32-byte)
- Mode: GCM (Galois/Counter Mode) with authentication
- Purpose: Encrypt tokens to prevent re-identification
Technical details:
- Initialization Vector (IV): 12 bytes, randomly generated per token
- Authentication tag: 128-bit (16-byte) GCM tag for integrity
- Padding: NoPadding (GCM mode handles message length)
- Algorithm:
AES/GCM/NoPadding
Security properties:
- Authenticated encryption (detects tampering)
- Unique IV per token prevents pattern analysis
- Computationally infeasible to brute-force (2^256 possible keys)
- Reversible only with correct encryption key
Key Management & Secrets
This section consolidates practical guidance for managing the cryptographic secrets OpenToken requires.
Types of Secrets
OpenToken expects two secrets (one required, one optional depending on mode):
| Secret | CLI Flag | Purpose | Requirements |
|---|---|---|---|
| Hashing Secret | -h / --hashingsecret |
HMAC-SHA256 key for deterministic hashing | Required in all modes; 8+ characters recommended, 16+ ideal |
| Encryption Key | -e / --encryptionkey |
AES-256-GCM symmetric key | Required for encryption mode; exactly 32 characters |
Hash-only mode (--hash-only) skips AES encryption; only the hashing secret is needed.
Handling Secrets in Practice
Development / Local Testing
Use clearly marked placeholder values:
# Placeholder secrets for local testing only
java -jar opentoken-cli-*.jar \
-i sample.csv -t csv -o output.csv \
-h "HashingKey" \
-e "Secret-Encryption-Key-Goes-Here."
Store these in a local .env file (not committed):
# .env (add to .gitignore)
OPENTOKEN_HASHING_SECRET=HashingKey
OPENTOKEN_ENCRYPTION_KEY=Secret-Encryption-Key-Goes-Here.
Load and use:
source .env
java -jar opentoken-cli-*.jar \
-i sample.csv -t csv -o output.csv \
-h "$OPENTOKEN_HASHING_SECRET" \
-e "$OPENTOKEN_ENCRYPTION_KEY"
Production
Store secrets in a managed secret store and inject via environment variables at runtime:
| Platform | Secret Store | Injection Method |
|---|---|---|
| AWS | Secrets Manager | aws secretsmanager get-secret-value or ECS/Lambda secrets |
| Azure | Key Vault | az keyvault secret show or App Service key references |
| GCP | Secret Manager | gcloud secrets versions access or workload identity |
| On-prem | HashiCorp Vault | vault kv get or agent auto-auth |
| Databricks | Databricks Secrets | dbutils.secrets.get("scope", "key") |
Example (AWS Secrets Manager):
export OPENTOKEN_HASHING_SECRET=$(aws secretsmanager get-secret-value \
--secret-id opentoken-hash-key --query SecretString --output text)
export OPENTOKEN_ENCRYPTION_KEY=$(aws secretsmanager get-secret-value \
--secret-id opentoken-enc-key --query SecretString --output text)
java -jar opentoken-cli-*.jar \
-i data.csv -t csv -o tokens.csv \
-h "$OPENTOKEN_HASHING_SECRET" \
-e "$OPENTOKEN_ENCRYPTION_KEY"
Example (Databricks):
from opentoken_pyspark import SparkPersonTokenProcessor
processor = SparkPersonTokenProcessor(
spark=spark,
hashing_secret=dbutils.secrets.get("opentoken", "hashing_secret"),
encryption_key=dbutils.secrets.get("opentoken", "encryption_key")
)
Secret Rotation
- Generate new secrets – use a cryptographically secure generator.
- Re-run token generation – tokens are deterministic; same input + same secrets = same tokens. New secrets = new tokens.
- Version secrets in your store – keep old versions for auditability.
- Coordinate downstream – any system that decrypts tokens needs the matching encryption key.
What NOT to Do
- Never commit secrets to source control. Add
.envand similar files to.gitignore. - Never log secrets. CLI output and metadata files contain hashes of secrets, not the secrets themselves.
- Never hard-code secrets in scripts checked into git. Use environment variables or secret-store references.
Secret Verification via Metadata
Each run produces a .metadata.json with SHA-256 hashes of secrets:
{
"HashingSecretHash": "e0b4e60b...",
"EncryptionSecretHash": "a1b2c3d4..."
}
Use tools/hash_calculator.py to verify:
python tools/hash_calculator.py \
--hashing-secret "YourSecret" \
--encryption-key "YourEncryptionKey"
# Compare output hashes to metadata file
Cross-References
- CLI flags for secrets: CLI Reference
- Environment variable usage: Configuration
- Databricks / Spark secrets: Spark or Databricks
- Running the CLI: Running OpenToken
- Metadata format (hash fields): Reference: Metadata Format
Security Considerations and Limitations
What OpenToken Protects Against
✓ Re-identification without secrets:
- Encrypted tokens cannot be reversed without encryption key
- Hashed tokens cannot be reversed (one-way HMAC-SHA256)
- Attacker with tokens alone cannot recover person data
✓ Rainbow table attacks:
- HMAC-SHA256 with secret prevents pre-computed lookup tables
- Different secret produces different tokens for same input
✓ Data quality issues: Metadata captures processing statistics; data quality guidance lives in the concepts documentation.
What OpenToken Does NOT Protect Against
✗ Compromise of secrets:
- If attacker obtains hashing secret + encryption key, they can regenerate tokens from known person data
- Token security depends entirely on secret protection
✗ Side-channel attacks:
- Timing attacks, memory access patterns not specifically mitigated
- Use secure execution environments for sensitive workloads
✗ Statistical analysis with auxiliary data:
- If attacker has auxiliary demographic data and token frequency distributions, statistical attacks may be possible
- Consider differential privacy techniques for high-risk scenarios
✗ Token distribution analysis:
- Tokens are deterministic (same person always produces same token)
- Frequency analysis may reveal population patterns
- Mitigate by limiting token distribution and enforcing access controls
User Responsibilities
OpenToken provides cryptographic primitives but users are responsible for:
- Secret management: Storing, rotating, and protecting hashing secrets and encryption keys
- Access control: Limiting who can generate, access, or decrypt tokens
- Token storage: Encrypting token files at rest (file system encryption, database encryption)
- Audit logging: Tracking token generation, access, and decryption events
- Data minimization: Deleting raw person data after token generation
- Compliance: Ensuring usage aligns with HIPAA, GDPR, or organizational policies
Threat Model Assumptions
Assumptions:
- Secrets are stored securely and not accessible to unauthorized parties
- Execution environment is trusted (no malware or unauthorized access)
- Token outputs are protected with access controls
- Users validate data quality before token generation
Out of scope:
- Protection against compromised execution environments
- Protection after decryption (decrypted tokens are plaintext hashes)
- Protection against authorized users misusing tokens
Data Quality: Normalization and Validation
Normalization and validation rules are documented separately to keep this page focused on cryptography and secret management.
See Concepts: Normalization and Validation.
Next Steps
- View detailed crypto pipeline: Specification
- Understand metadata security: Reference: Metadata Format
- Review validation rules: Concepts: Normalization and Validation
- Configure OpenToken: Configuration
- Share tokens across organizations: Sharing Tokenized Data