Running OpenToken
Guides for generating tokens in different environments and use cases.
CLI Guide
The OpenToken CLI accepts command-line arguments for flexible token generation. Both Java and Python CLIs support identical options.
Basic Syntax
opentoken-cli [OPTIONS] -i <input> -t <type> -o <output> -h <hashing-secret> [-e <encryption-key>]
Arguments
Required
| Argument | Alias | Description | Example |
|---|---|---|---|
-i |
--input |
Input file path (CSV or Parquet) | -i data.csv |
-t |
--type |
Input file type | -t csv or -t parquet |
-o |
--output |
Output file path | -o tokens.csv |
-h |
--hashingsecret |
HMAC-SHA256 hashing secret | -h "MyHashingKey" |
Optional
| Argument | Alias | Description | Default | Example |
|---|---|---|---|---|
-e |
--encryptionkey |
AES-256 encryption key | Required (unless --hash-only) |
-e "MyEncryptionKey" |
-ot |
--output-type |
Output file type | Same as input type | -ot parquet |
--hash-only |
Hash-only mode (no encryption) | False | --hash-only |
|
-d |
--decrypt |
Decrypt mode (reverse previous encryption) | False | -d |
Usage Examples
Token Generation (Encryption Mode)
Generates encrypted tokens. Both hashing secret and encryption key required.
Java:
cd lib/java
mvn clean install -DskipTests
java -jar opentoken-cli/target/opentoken-cli-*.jar \
-i ../../resources/sample.csv \
-t csv \
-o ../../resources/output.csv \
-h "HashingKey" \
-e "Secret-Encryption-Key-Goes-Here."
Python:
cd lib/python/opentoken-cli
source ../../.venv/bin/activate
pip install -r requirements.txt -e . -e ../opentoken
python -m opentoken_cli.main \
-i ../../../resources/sample.csv \
-t csv \
-o ../../../resources/output.csv \
-h "HashingKey" \
-e "Secret-Encryption-Key-Goes-Here."
Token Generation (Hash-Only Mode)
Generates HMAC-SHA256 hashed tokens without AES encryption. Only hashing secret required.
Java:
java -jar opentoken-cli/target/opentoken-cli-*.jar \
--hash-only \
-i ../../resources/sample.csv \
-t csv \
-o ../../resources/hashed-output.csv \
-h "HashingKey"
Python:
python -m opentoken_cli.main \
--hash-only \
-i ../../../resources/sample.csv \
-t csv \
-o ../../../resources/hashed-output.csv \
-h "HashingKey"
Token Decryption
Decrypts previously encrypted tokens. Only encryption key required.
Java:
java -jar opentoken-cli/target/opentoken-cli-*.jar \
-d \
-i ../../resources/output.csv \
-t csv \
-o ../../resources/decrypted.csv \
-e "Secret-Encryption-Key-Goes-Here."
Python:
python -m opentoken_cli.main \
-d \
-i ../../../resources/output.csv \
-t csv \
-o ../../../resources/decrypted.csv \
-e "Secret-Encryption-Key-Goes-Here."
Output Files
Token generation produces two files:
Tokens File (CSV or Parquet):
RecordId,RuleId,Token
record1,T1,Gn7t1Zj16E5Qy+z9iINtczP6fRDYta6C0XFrQtpjnVQSEZ5pQXAzo02Aa9LS9oNMOog6Ssw9GZE6fvJrX2sQ/cThSkB6m91L
record1,T2,pUxPgYL9+cMxkA+8928Pil+9W+dm9kISwHYPdkZS+I2nQ/bQ/8HyL3FOVf3NYPW5NKZZO1OZfsz7LfKYpTlaxyzMLqMF2Wk7
...
Metadata File (always JSON, suffixed .metadata.json):
{
"JavaVersion": "21.0.0",
"OpenTokenVersion": "1.13.2",
"Platform": "Java",
"TotalRows": 1,
"TotalRowsWithInvalidAttributes": 0,
"InvalidAttributesByType": {},
"BlankTokensByRule": {},
"HashingSecretHash": "abc123...",
"EncryptionSecretHash": "def456..."
}
See Reference: Metadata Format for detailed field descriptions.
Docker
Use Docker for a containerized, dependency-free environment.
Option 1: Convenience Scripts (Recommended)
Scripts automatically build and run the container.
Bash (Linux/Mac):
cd /path/to/OpenLinkToken
./run-opentoken.sh \
-i ./resources/sample.csv \
-o ./resources/output.csv \
-t csv \
-h "HashingKey" \
-e "Secret-Encryption-Key-Goes-Here."
PowerShell (Windows):
cd C:\path\to\OpenLinkToken
.\run-opentoken.ps1 `
-i .\resources\sample.csv `
-o .\resources\output.csv `
-FileType csv `
-h "HashingKey" `
-e "Secret-Encryption-Key-Goes-Here."
Script Options
| Option | Bash Alias | PowerShell | Description |
|---|---|---|---|
| File type | -t |
-FileType |
csv or parquet |
| Skip rebuild | -s |
-SkipBuild |
Reuse existing image |
| Verbose | -v |
-Verbose |
Show detailed output |
Run with --help (Bash) or -Help (PowerShell) for full usage.
Option 2: Manual Docker Commands
Build and run the image manually from the repository root.
# Build the image
docker build -t opentoken:latest .
# Run with sample data
docker run --rm -v $(pwd)/resources:/app/resources \
opentoken:latest \
-i /app/resources/sample.csv \
-t csv \
-o /app/resources/output.csv \
-h "HashingKey" \
-e "Secret-Encryption-Key-Goes-Here."
# View output
cat resources/output.csv
cat resources/output.metadata.json
Dev Container: If running in a dev container, use absolute path:
docker run --rm -v /workspaces/OpenLinkToken/resources:/app/resources \
opentoken:latest ...
PySpark Bridge
For large-scale distributed token generation and dataset overlap analysis, use the PySpark bridge.
When to Use PySpark
- Large datasets: Millions of records across multiple files
- Distributed processing: Leverage cluster computing
- Overlap analysis: Find matching records across datasets at scale
- Cost-effective: Process on cloud infrastructure (AWS, GCP, Azure)
Installation
Ensure the Python root venv is active, then install:
source /workspaces/OpenLinkToken/.venv/bin/activate
cd lib/python/opentoken-pyspark
pip install -r requirements.txt -e .
Basic Usage
from opentoken_pyspark import SparkPersonTokenProcessor
# Initialize Spark session
spark = SparkSession.builder \
.appName("OpenToken") \
.getOrCreate()
# Create processor
processor = SparkPersonTokenProcessor(
spark=spark,
hashing_secret="HashingKey",
encryption_key="Secret-Encryption-Key"
)
# Process dataset
tokens_df = processor.process_dataframe(
input_df=input_spark_df,
input_type="csv" # or "parquet"
)
# Write output
tokens_df.coalesce(1).write \
.mode("overwrite") \
.csv("output/tokens")
Examples
See example notebooks in lib/python/opentoken-pyspark/notebooks/:
- Custom_Token_Definition_Guide.ipynb – Define custom token rules
- Dataset_Overlap_Analysis_Guide.ipynb – Find overlapping records across datasets
Troubleshooting
“Encryption key not provided”
Problem: Error when running without -e flag and without --hash-only.
Solution: Either provide encryption key -e "YourKey" or use --hash-only:
java -jar opentoken-cli-*.jar --hash-only -i data.csv -t csv -o output.csv -h "HashingKey"
“Invalid BirthDate” or “Date out of range”
Problem: BirthDate attribute fails validation.
Causes:
- Date is before January 1, 1910
- Date is in the future
- Format is not recognized
Solution: Use YYYY-MM-DD format or one of the accepted formats (MM/DD/YYYY, MM-DD-YYYY, DD.MM.YYYY):
Correct: 1980-01-15, 01/15/1980
Wrong: 1905-01-01, 2025-12-31, 01-15-80
“Invalid SSN” or “SSN area/group/serial invalid”
Problem: SSN fails validation (area, group, or serial validation).
Causes:
- Area: 000, 666, or 900–999
- Group: 00
- Serial: 0000
- Common invalid sequences: 111-11-1111, 222-22-2222, etc.
Solution: Validate SSN before processing or regenerate test data. See Security for full rules.
“Invalid LastName” or “Name is placeholder”
Problem: Name is rejected as placeholder or invalid.
Causes:
- Value is placeholder: “Unknown”, “Test”, “N/A”, “Anonymous”, “Missing”
- LastName is too short (< 2 chars) without being special case (“Ng”)
- Null or empty
Solution: Clean data before processing. Remove or replace placeholder values.
“Docker image not found” or build fails
Problem: Docker image won’t build or run.
Causes:
- Docker daemon not running
- Insufficient disk space
- File path issues on Windows
Solution:
- Ensure Docker is running:
docker --version - Use absolute paths, not relative:
/workspaces/OpenLinkToken/resources - Clear Docker cache if needed:
docker system prune - Check file permissions:
ls -la resources/sample.csv
Tokens don’t match across Java/Python
Problem: Same input produces different tokens in Java vs. Python.
Causes:
- Different hashing/encryption secrets
- Different attribute normalization
- Unicode handling differences
Solution:
- Verify secrets match exactly
- Check attribute normalization (see Concepts: Normalization)
- Run the interoperability test suite:
tools/interoperability/java_python_interoperability_test.py - Decrypt and compare hashes to isolate the issue
CSV parsing errors or column not found
Problem: “Column ‘FirstName’ not found” or CSV parse error.
Causes:
- Column names don’t match expected aliases
- Commas within values without quoting
- Encoding issues (non-UTF-8)
Solution:
- Verify column names match accepted aliases (see Configuration)
- Quote values containing commas:
"Doe, Jr." - Ensure UTF-8 encoding
- Use Parquet format if CSV parsing continues to fail
Next Steps
- Get started: Quickstarts
- Configure input formats: Configuration
- Understand token matching: Concepts: Token Rules
- Read metadata format: Reference: Metadata Format
- Contribute improvements: Community: Contributing