Configuration

Configuration options for OpenToken inputs, outputs, secrets, and runtime behavior.

CLI Arguments

OpenToken can be run from Java or Python CLIs, or via the helper shell/PowerShell scripts.

At a high level you must always specify:

the input path and type (CSV or Parquet)
an output path for tokens
a hashing secret (required)
either an encryption key (for encrypted mode) or --hash-only (for hash-only mode)
optionally --decrypt when reading previously encrypted tokens

For the complete, authoritative list of flags, short options, and defaults, see the CLI Reference.

Environment Variables

Secrets can be passed via environment variables for security:

export OPENTOKEN_HASHING_SECRET="MyHashingKey"
export OPENTOKEN_ENCRYPTION_KEY="MyEncryptionKey32CharactersLong"

java -jar opentoken-cli-*.jar \
  -i data.csv -t csv -o tokens.csv \
  -h "$OPENTOKEN_HASHING_SECRET" \
  -e "$OPENTOKEN_ENCRYPTION_KEY"

Docker Environment

docker run --rm \
  -e OPENTOKEN_HASHING_SECRET="MyHashingKey" \
  -e OPENTOKEN_ENCRYPTION_KEY="MyEncryptionKey32CharactersLong" \
  -v $(pwd)/resources:/app/resources \
  opentoken:latest \
  -i /app/resources/sample.csv \
  -t csv \
  -o /app/resources/output.csv \
  -h "$OPENTOKEN_HASHING_SECRET" \
  -e "$OPENTOKEN_ENCRYPTION_KEY"

Input File Format

Supported Formats

Format	Extension	Description
CSV	`.csv`	Comma-separated values with header row
Parquet	`.parquet`	Columnar binary format (recommended for large files)

Column Names & Aliases

Input columns are case-insensitive and support common aliases:

Attribute	Accepted Column Names	Required	Type
Record ID	`RecordId`, `Id`	Optional	String
First Name	`FirstName`, `GivenName`	Yes	String
Last Name	`LastName`, `Surname`	Yes	String
Birth Date	`BirthDate`, `DateOfBirth`	Yes	Date
Sex	`Sex`, `Gender`	Yes	String
Postal Code	`PostalCode`, `ZipCode`, `ZIP3`, `ZIP4`, `ZIP5`	Yes	String
SSN	`SocialSecurityNumber`, `NationalIdentificationNumber`	Yes	String

Date Formats Accepted

YYYY-MM-DD (recommended)
MM/DD/YYYY
MM-DD-YYYY
DD.MM.YYYY

Sex Values Accepted

Male, M
Female, F

(Case-insensitive)

SSN Formats Accepted

123-45-6789 (preferred input format)
Digits-only values (normalized automatically; dashes removed internally)

Postal Code Formats

US ZIP Codes:

98004 (5 digits)
98004-1234 (9 digits, dash removed)
980 (ZIP-3, auto-padded to 98000)

Canadian Postal Codes:

K1A 1A1 (with space)
K1A1A1 (without space, auto-formatted)

Output Configuration

Output Type Override

Use -ot to specify a different output format:

# Input CSV, output Parquet
java -jar opentoken-cli-*.jar \
  -i data.csv -t csv \
  -o tokens.parquet -ot parquet \
  -h "HashingKey" -e "EncryptionKey"

Output Files Generated

Each run produces two files:

Tokens file: <output_path> (CSV or Parquet)
Metadata file: <output_path>.metadata.json (always JSON)

Processing Modes

OpenToken supports three processing modes that control how token signatures are transformed:

Encryption (default) – produces encrypted tokens suitable for external exchange; requires both a hashing secret and an encryption key.
Hash-only – produces one-way hashed tokens for internal matching and overlap analysis; requires only the hashing secret.
Decrypt – takes previously encrypted tokens and decrypts them back to their hashed form (equivalent to hash-only output).

For the exact CLI flags that enable each mode, see the CLI Reference.

Secret Requirements

Hashing Secret

Purpose: HMAC-SHA256 key for deterministic hashing
Minimum length: 8 characters recommended
Best practice: 16+ characters with mixed case and digits

Encryption Key

Purpose: AES-256-GCM symmetric encryption key
Required length: Exactly 32 characters (32 bytes)
Error if wrong length: “Key must be 32 characters long”

Environment-Specific Configuration

Local Development

# Java
cd lib/java
mvn clean install -DskipTests
java -jar opentoken-cli/target/opentoken-cli-*.jar \
  -i ../../resources/sample.csv -t csv -o ../../resources/output.csv \
  -h "HashingKey" -e "EncryptionKey32Characters!!!!!"

# Python
source /workspaces/OpenLinkToken/.venv/bin/activate
python -m opentoken_cli.main \
  -i ../../../resources/sample.csv -t csv -o ../../../resources/output.csv \
  -h "HashingKey" -e "EncryptionKey32Characters!!!!!"

Docker Container

./run-opentoken.sh \
  -i ./resources/sample.csv \
  -o ./resources/output.csv \
  -t csv \
  -h "HashingKey" \
  -e "EncryptionKey32Characters!!!!!"

Spark/Databricks Cluster

from opentoken_pyspark import OpenTokenProcessor

processor = OpenTokenProcessor(
    hashing_secret=dbutils.secrets.get("opentoken", "hashing_secret"),
    encryption_key=dbutils.secrets.get("opentoken", "encryption_key")
)

See Spark or Databricks for cluster configuration.

Handling Missing/Invalid Data

Scenario	Behavior
RecordId missing	Auto-generates UUID for each record
Required column missing	Processing fails with column name mismatch error
NULL/empty value	Record marked invalid; counted in metadata
Invalid attribute	Record marked invalid; blank token for affected rules

Next Steps

Batch processing: Running Batch Jobs
Metadata format: Reference: Metadata Format