Mock Data Workflows

How to generate and process mock datasets for testing OpenToken pipelines.


Overview

OpenToken includes a mock data generator for testing and development. The tool creates realistic person records with configurable duplicate rates for overlap analysis testing.


Mock Data Generator

Location: tools/mockdata/data_generator.py

Usage

cd tools/mockdata
python data_generator.py <num_lines> <repeat_probability> <output_file>

Parameters

Parameter Description Example
num_lines Total number of records to generate 100
repeat_probability Fraction of records that are duplicates (0.0–1.0) 0.05
output_file Output CSV file path test_data.csv

Examples

# Generate 100 records with 5% duplicates
python data_generator.py 100 0.05 test_data.csv

# Generate 10,000 records with 10% duplicates
python data_generator.py 10000 0.10 large_test.csv

# Generate 1,000 records with no duplicates
python data_generator.py 1000 0.0 unique_records.csv

Default Values

If no arguments provided:

  • num_lines: 100
  • repeat_probability: 0.05 (5%)
  • output_file: test_data.csv

Output Format

Generated CSV files include all required OpenToken columns:

RecordId,BirthDate,FirstName,LastName,PostalCode,Sex,SocialSecurityNumber
550e8400-e29b-41d4-a716-446655440000,1985-03-15,John,Smith,98004,Male,123-45-6789
550e8400-e29b-41d4-a716-446655440001,1975-06-20,Jane,Doe,10001,Female,987-65-4321
...

Generated Fields

Field Generator Example
RecordId UUID4 550e8400-e29b-...
BirthDate Random date (0–90 years ago) 1985-03-15
FirstName Faker first_name() John
LastName Faker last_name() Smith
PostalCode Faker zipcode() 98004
Sex Random: Male/Female Male
SocialSecurityNumber Faker ssn() 123-45-6789

Duplicate Records

Duplicates are created by copying existing records with new RecordId values. This simulates:

  • Same person appearing multiple times in a dataset
  • Overlap between datasets for testing matching algorithms

How Duplicates Work

Total records: 100
Repeat probability: 0.05

Unique records generated: 95
Duplicates added: 5 (randomly selected from the 95)
Total output: 100 records (95 unique + 5 duplicates)

Duplicates have different RecordIds but identical attributes, so they produce matching tokens.


Testing Workflows

Basic Token Generation Test

# 1. Generate test data
cd tools/mockdata
python data_generator.py 100 0.05 test_data.csv

# 2. Process with OpenToken
cd ../../
./run-opentoken.sh \
  -i tools/mockdata/test_data.csv \
  -o resources/test_output.csv \
  -t csv \
  -h "HashingKey" \
  -e "Secret-Encryption-Key-Goes-Here."

# 3. Check output
cat resources/test_output.csv
cat resources/test_output.metadata.json

Overlap Analysis Test

Generate two datasets with controlled overlap:

# Generate two datasets
cd tools/mockdata
python data_generator.py 1000 0.0 dataset_a.csv
python data_generator.py 1000 0.0 dataset_b.csv

# Add some common records manually or use a script
# Then process both with OpenToken using the same secrets

For automated overlap analysis, see Spark or Databricks.


Sample Data Files

Pre-generated sample files are available in resources/:

File Description
sample.csv Small test file for quickstart
mockdata/test_data.csv 100 records with duplicates
mockdata/test_overlap1.csv Dataset for overlap testing
mockdata/test_overlap2.csv Dataset for overlap testing

Requirements

The mock data generator requires:

pip install faker

Or install from the tools requirements:

cd tools
pip install -r requirements.txt

Large-Scale Testing

For generating large test datasets:

# Generate 1 million records (takes a few minutes)
python data_generator.py 1000000 0.02 large_test.csv

# Progress is printed every 1000 records
# Writing record 0
# Writing record 1000
# Writing record 2000
# ...

Tip: For very large files, use Parquet format for processing:

# Convert CSV to Parquet (using pandas or spark)
python -c "
import pandas as pd
df = pd.read_csv('large_test.csv')
df.to_parquet('large_test.parquet')
"

# Process with OpenToken
java -jar opentoken-cli-*.jar \
  -i large_test.parquet -t parquet \
  -o tokens.parquet \
  -h "HashingKey" -e "EncryptionKey"

Next Steps