Sharing Tokenized Data Between Organizations

How to securely exchange OpenToken outputs for cross-organization person matching.


Overview

Organizations often need to identify overlapping individuals across datasets without exposing raw person data. OpenToken enables this by generating deterministic, cryptographically secure tokens that can be shared and matched externally.

Typical scenario:

  1. Organization A and Organization B each hold patient records
  2. Both organizations run OpenToken on their data using the same secrets
  3. They exchange only the token output (no raw PII)
  4. Matching tokens indicate the same person exists in both datasets
┌─────────────────┐                    ┌─────────────────┐
│  Organization A │                    │  Organization B │
│  (Patient Data) │                    │  (Patient Data) │
└────────┬────────┘                    └────────┬────────┘
         │                                      │
         ▼                                      ▼
┌─────────────────┐                    ┌─────────────────┐
│   OpenToken     │                    │   OpenToken     │
│ (same secrets)  │                    │ (same secrets)  │
└────────┬────────┘                    └────────┬────────┘
         │                                      │
         ▼                                      ▼
┌─────────────────┐     Exchange       ┌─────────────────┐
│  Tokens + Meta  │ ←───────────────→  │  Tokens + Meta  │
└────────┬────────┘                    └────────┬────────┘
         │                                      │
         └───────────────┬──────────────────────┘
                         ▼
               ┌─────────────────┐
               │  Token Matching │
               │  (find overlap) │
               └─────────────────┘

Sender Workflow

The sending organization prepares tokenized data for sharing.

Step 1: Agree on Shared Secrets

Before tokenization, both parties must agree on:

  • Hashing secret (required): Used for HMAC-SHA256
  • Encryption key (recommended for external sharing): Used for AES-256-GCM

Best practice: Use a secure channel (encrypted email, secure file transfer, or direct key exchange in person) to share secrets. Never send secrets alongside token files.

Step 2: Run OpenToken

Generate tokens using the agreed-upon secrets.

Encrypted mode (recommended for external sharing):

java -jar opentoken-cli-*.jar \
  -i patient_data.csv \
  -t csv \
  -o tokens_for_partner.csv \
  -h "$SHARED_HASHING_SECRET" \
  -e "$SHARED_ENCRYPTION_KEY"

Hash-only mode (overlap analysis helper, internal artifact):

Hash-only output is primarily used inside your environment to support overlap analysis against encrypted tokens received from a partner:

java -jar opentoken-cli-*.jar \
  --hash-only \
  -i local_patient_data.csv \
  -t csv \
  -o local_hash_only_tokens.csv \
  -h "$SHARED_HASHING_SECRET"

Typical pattern:

  1. Both parties exchange encrypted tokens only.
  2. You decrypt the partner’s encrypted tokens to the hash-only equivalent (see Decrypting Tokens).
  3. You generate hash-only tokens for your own dataset.
  4. You perform overlap analysis by joining the two hash-only datasets.

See Hash-Only Mode for trade-offs between encrypted and hash-only tokens.

Step 3: Review Metadata

Check the generated .metadata.json for processing statistics:

{
  "TotalRows": 50000,
  "TotalRowsWithInvalidAttributes": 120,
  "InvalidAttributesByType": {
    "SocialSecurityNumber": 80,
    "BirthDate": 40
  },
  "BlankTokensByRule": {
    "T1": 80,
    "T3": 80
  },
  "HashingSecretHash": "e0b4e60b...",
  "EncryptionSecretHash": "a1b2c3d4..."
}

Key checks:

  • TotalRowsWithInvalidAttributes: High counts may indicate data quality issues
  • BlankTokensByRule: T1 and T3 require SSN; blanks are expected if SSN is often missing
  • HashingSecretHash / EncryptionSecretHash: Share these hashes (not the secrets) so the recipient can verify they used the correct keys

Step 4: Prepare Transfer Package

Include in the transfer:

File Purpose Contains Secrets?
tokens.csv (or .parquet) Token output No
tokens.metadata.json Processing stats, secret hashes Hashes only (not secrets)
Data dictionary (optional) Column definitions, RecordId mapping No

Do NOT include:

  • Raw input data (PII)
  • Hashing secret or encryption key (share separately via secure channel)
  • Decrypted tokens

Step 5: Transfer Securely

Use encrypted file transfer:

  • SFTP with encryption
  • Cloud storage with encryption at rest (S3, Azure Blob, GCS)
  • Secure email with encrypted attachment

Recipient Workflow

The receiving organization ingests shared tokens and matches against their own data.

Step 1: Obtain Shared Secrets

Receive the hashing secret (and encryption key, if applicable) through a secure channel separate from the token files.

Step 2: Verify Secret Hashes

Before processing, verify that your secrets match the sender’s:

python tools/hash_calculator.py \
  --hashing-secret "$SHARED_HASHING_SECRET" \
  --encryption-key "$SHARED_ENCRYPTION_KEY"

Compare the output hashes with HashingSecretHash and EncryptionSecretHash in the received metadata file. If they don’t match, tokens will not match correctly.

Step 3: Generate Your Own Tokens

Run OpenToken on your local data using the same secrets:

java -jar opentoken-cli-*.jar \
  -i local_patient_data.csv \
  -t csv \
  -o local_tokens.csv \
  -h "$SHARED_HASHING_SECRET" \
  -e "$SHARED_ENCRYPTION_KEY"

Step 4: Match Tokens

Join token files to find matching records:

-- Find overlapping patients
SELECT
    partner.RecordId AS PartnerRecordId,
    local.RecordId AS LocalRecordId,
    partner.RuleId
FROM partner_tokens partner
JOIN local_tokens local
    ON partner.Token = local.Token
    AND partner.RuleId = local.RuleId;

Interpretation:

  • T4 match: Very high confidence (SSN + Sex + BirthDate)
  • T3 match: High confidence (Last name + full first name + Sex + BirthDate)
  • Multiple rule matches: Stronger confidence (more agreement across attributes)

See Matching Model for matching strategies.

Step 5: Handle Encrypted Tokens (If Applicable)

If tokens are encrypted and you need to debug or verify:

java -jar opentoken-cli-*.jar \
  -d \
  -i partner_tokens.csv \
  -t csv \
  -o partner_decrypted.csv \
  -e "$SHARED_ENCRYPTION_KEY"

See Decrypting Tokens for details.


Security Considerations

Use Encrypted Tokens for External Sharing

Encrypted mode (-e flag) adds AES-256-GCM encryption on top of HMAC-SHA256:

Mode External Sharing Defense in Depth Reversible
Encrypted ✓ Recommended Yes To HMAC hash (with key)
Hash-only ⚠ Use with caution No Not reversible

Encrypted tokens provide an additional security layer if token files are intercepted.

Protect Shared Secrets

  • Never send secrets with token files. Use a separate secure channel.
  • Store secrets in a vault (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault)
  • Limit access to secrets to authorized personnel only
  • Rotate secrets periodically and re-tokenize as needed

Verify Partner Identity

Before sharing:

  • Confirm the partner organization’s identity through established business relationships
  • Use signed data-sharing agreements
  • Verify contact information through independent channels

Audit and Logging

  • Log token generation events: Who ran OpenToken, when, with which input file
  • Log token transfers: When tokens were sent/received, to/from whom
  • Log matching events: Who performed matching, what results were produced

Minimize Data Exposure

  • Share only tokens: Never share raw input data alongside tokens
  • Limit token rules if appropriate: If only T2 matching is needed, consider generating only T2 tokens
  • Use RecordId mappings: Map internal IDs to opaque identifiers before sharing

What Tokens Do NOT Reveal

Tokens are one-way transformations. Without the hashing secret:

  • Attackers cannot reverse tokens to original attributes
  • Attackers cannot generate new tokens for known individuals
  • Attackers cannot determine which attributes produced a token

With only the token file, an attacker cannot identify individuals.


Common Pitfalls

Issue Cause Solution
Zero matches between datasets Different secrets used Verify secret hashes match in metadata files
Partial matches only Normalization differences Ensure both parties use the same OpenToken version
High invalid record counts Data quality issues Clean data before tokenization; review validation rules
Secrets exposed in logs Logging misconfiguration Configure logging to exclude sensitive parameters
Token file intercepted Insecure transfer Use encrypted file transfer; prefer encrypted tokens

Checklist: Before Sharing

Sender:

  • Agreed on secrets with recipient (via secure channel)
  • Generated tokens with correct secrets
  • Verified metadata shows expected row counts
  • Prepared transfer package (tokens + metadata only)
  • Using encrypted file transfer

Recipient:

  • Received secrets via secure channel (separate from token files)
  • Verified secret hashes match sender’s metadata
  • Generated own tokens with same secrets
  • Matching logic uses correct RuleId and Token columns