
How do we set up sensitive data discovery rules in Tonic Structural for PII/PHI?
Most teams hit the same wall: you know there’s PII/PHI scattered across your databases, but you don’t have a reliable, automatable way to find it before it leaks into dev, staging, QA, or AI pipelines. In Tonic Structural, sensitive data discovery rules are the foundation for everything else—if you get this right, you can safely generate high-fidelity test data and stay ahead of compliance instead of chasing it.
This guide walks through how to set up PII/PHI discovery in Tonic Structural so you can continuously identify sensitive fields, apply the right transformations, and keep production realism without exposing real identities.
Quick Answer: You set up PII/PHI discovery in Tonic Structural by combining built-in data discovery with custom sensitivity rules. Start by scanning your schemas, then layer in pattern-based, metadata-based, and column-level rules to automatically tag sensitive fields and drive de-identification and synthesis at scale.
The Quick Overview
- What It Is: Sensitive data discovery in Tonic Structural is a rules-driven engine that automatically identifies PII/PHI and other regulated data across your structured and semi-structured sources. It powers consistent de-identification and synthesis without manual column chasing.
- Who It Is For: Data, platform, and security teams responsible for keeping lower environments and AI workflows privacy-safe while still giving engineering realistic, production-shaped data.
- Core Problem Solved: It eliminates blind spots and ad-hoc scripts by systematically detecting sensitive fields—so you don’t rely on tribal knowledge or manual audits to avoid copying live PII/PHI into dev.
How It Works
At a high level, Tonic Structural turns sensitive data discovery into a repeatable workflow:
- Scan and classify your schemas: Structural connects to your production databases, inspects schemas, and runs built-in data discovery to propose sensitivity classifications for likely PII/PHI fields.
- Define and refine sensitivity rules: You layer on custom discovery rules—based on patterns (e.g., SSNs), column names, data types, and metadata—to encode your organization’s definition of “sensitive.”
- Bind rules to transformations: Once fields are tagged, you attach consistent transforms (masking, synthesis, format-preserving encryption) so every new dataset or schema change inherits the right protections automatically.
The result: instead of manually hunting for PII every time you create test data, sensitive data discovery rules become the policy-as-code layer that Structural executes every run.
Step 1: Establish Your Sensitivity Model
Before you configure anything in Tonic Structural, decide what “sensitive” means in your environment. That sounds obvious, but it’s where most teams get tripped up.
1.1 Define PII/PHI categories
Start by grouping data into concrete buckets:
- Direct identifiers (PII): Names, SSNs, national IDs, email addresses, phone numbers, full addresses, account numbers, license numbers, etc.
- Quasi-identifiers: Date of birth, ZIP + gender + age, IP addresses, device IDs, session IDs, etc.
- PHI (for HIPAA / healthcare): Diagnosis codes (ICD), procedure codes (CPT), lab results, medication names, encounter IDs tied to a person, notes fields that may embed health details.
- Regulatory-specific fields: Anything explicitly called out in GDPR, CCPA, PCI DSS, or your BAA/data processing agreements.
This model will map directly to sensitivity tags and rule scopes in Structural.
1.2 Decide protection levels
For each category, define:
- Required transform type: e.g., format-preserving encryption for emails, synthetic generation for patients, deterministic masking for IDs used in joins.
- Reversibility: What must be irreversibly de-identified vs what can be reversibly tokenized under strict key management.
- Usage constraints: Fields that can be generalized (birth year instead of full DOB) vs those that must keep fine-grained distributions for analytics and tests.
You’ll encode these decisions into Tonic via sensitivity categories and associated transforms.
Step 2: Run Initial Data Discovery in Tonic Structural
Now you use Structural’s built-in discovery capabilities to get a baseline map of PII/PHI.
2.1 Connect your source databases
Typical sources include:
- OLTP databases (PostgreSQL, MySQL, SQL Server, Oracle)
- Cloud data warehouses (Snowflake, BigQuery, Redshift)
- Other structured/semi-structured stores
You point Tonic Structural at your production (or production-like) instance with read access; it inspects:
- Schemas
- Tables and views
- Column names and data types
- Sampling of values (depending on configuration and access policies)
2.2 Run schema scanning and auto-classification
With the connection established, run Structural’s data discovery:
- Schema scanning: Structural analyzes table/column names, types, and in some cases value patterns to flag likely PII/PHI.
- Predefined detectors: Built-in heuristics for common fields like:
email,user_email,contact_emailphone,mobile,cellssn,social_security_number, tax IDsaddress,street,zip,postal_codedob,birthdate,date_of_birth
- Candidate sensitivity tags: Structural suggests sensitivity classifications (PII, PHI, financial, etc.) for your review.
This gives you a draft sensitivity map without writing a single rule, which you’ll harden with custom rules in the next step.
Step 3: Configure Custom Sensitivity Rules for PII/PHI
Built-in discovery is a fast start, but it doesn’t know your domain-specific naming, your legacy quirks, or how your particular schemas express PHI. That’s where custom sensitivity rules come in.
Think of these as “pattern matchers plus actions” that Structural runs continuously.
3.1 Name-based discovery rules
Use column, table, and schema naming conventions to drive discovery:
- Column name patterns:
- Regex-based rules like:
- Columns matching
(?i).*email.*→ tag as PII: Email - Columns matching
(?i).*(ssn|social_security).*→ tag as PII: National ID - Columns matching
(?i).*(mrn|medical_record_number).*→ tag as PHI: Medical ID
- Columns matching
- Case-insensitive, with room for legacy variants.
- Regex-based rules like:
- Table context:
- Any column named
idin table names matching(?i).*patient.*→ tag as PHI: Patient Identifier. - Any column
member_idinclaims_*tables → PHI: Member ID.
- Any column named
This codifies tribal knowledge that only your team currently remembers.
3.2 Data type and length rules
Some fields are sensitive not because of names but because of types and shapes:
- Type + length combos:
VARCHAR(9)orCHAR(9)with US-SSN pattern → PII: SSN.CHAR(16)numeric in payments schemas → PCI-like card token or account number.
- Date fields:
- Date columns in tables matching
patient,encounter, oradmission→ PHI: Event Dates (subject to HIPAA date handling rules).
- Date columns in tables matching
You can scope these rules to specific schemas or databases to avoid over-tagging.
3.3 Value-pattern rules (regex-based)
When you enable value sampling, Structural can evaluate actual contents:
- Regex examples:
- Email:
[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}(case-insensitive) - Phone:
\+?\d[\d\-\s()]{7,} - US SSN:
\d{3}-\d{2}-\d{4}(with additional logic to exclude invalid ranges)
- Email:
- PHI patterns:
- ICD-10 codes:
[A-TV-Z][0-9][A-Z0-9](\.[A-Z0-9]{1,4})? - CPT codes:
\d{5}inprocedure_codecolumns.
- ICD-10 codes:
These rules help catch “generic” column names (value, data, info) that actually store identifiers.
3.4 Metadata-driven rules (classification tags)
If you already classify data upstream (e.g., in your data catalog or warehouse), Structural can align to that:
- Using existing tags:
- Columns tagged
sensitivity=piiorclassification=restrictedin your warehouse → auto-tag as sensitive in Tonic.
- Columns tagged
- Policy propagation:
- Align Structural rules with tags like
GDPR_subject,HIPAA_phi, orPCI_pan, so governance is consistent across stack.
- Align Structural rules with tags like
This turns your discovery into part of a broader data governance workflow rather than a standalone island.
Step 4: Map Sensitivity Tags to De-identification & Synthesis
Discovery rules are only useful if they drive the right transformations. In Tonic Structural, that mapping is explicit.
4.1 Attach transforms to sensitivity categories
For each sensitivity type, define defaults:
- Direct identifiers (PII/PHI)
- Use synthesis (generate entirely new values) for:
- Names, addresses, patient identifiers, member IDs
- Use format-preserving encryption or deterministic masking for:
- Emails, user IDs, account numbers used in joins
- Use synthesis (generate entirely new values) for:
- Quasi-identifiers
- Use generalization:
- DOB → birth year or age band
- Zip code → 3-digit prefix
- Use generalization:
- Medical details (PHI)
- Use category-preserving synthesis:
- Keep diagnosis and procedure distributions realistic while swapping individual codes.
- Maintain referential integrity between encounters, patients, and claims.
- Use category-preserving synthesis:
Structural’s engine preserves cross-table consistency and referential integrity, so your application logic, joins, and workflows still behave like they’re hitting production data—without live PII/PHI.
4.2 Override at column or table level when needed
Not every column in a sensitivity category should behave the same:
- Override for:
- Primary keys that must remain joinable across multiple systems.
- Analytics-critical measures that require preserving statistical properties.
- Example:
patient_id→ deterministic tokenization so related tables all point to the same synthetic patient.diagnosis_code→ synthetic but distribution-preserving so cohorts and risk models still behave realistically.
This lets you keep fidelity where it matters while staying aligned with your privacy posture.
Step 5: Validate and Iterate on Discovery Rules
You don’t get this perfect in one pass. The goal is a ruleset that you trust enough to run automatically with new schemas and datasets.
5.1 Review discovery results across schemas
Use Structural’s UI to:
- Inspect flagged columns and confirm or correct their sensitivity.
- Highlight:
- Columns you expected to be sensitive but weren’t flagged.
- Columns incorrectly tagged as sensitive (to refine patterns).
- Save corrections as new rules where appropriate.
Over time, this reduces manual overrides to edge cases only.
5.2 Test against known PII/PHI inventories
If you have:
- A data dictionary listing PII/PHI fields.
- Existing DLP or catalog classifications.
Use them as regression tests:
- Compare Structural’s discovered PII/PHI against your inventory.
- Patch gaps via:
- New name-based or value-based rules.
- Adjusting rule scopes (schemas, tables).
This turns your discovery rules into living policy-as-code.
5.3 Monitor schema changes with alerts
Real-world schemas evolve. New columns get added, and that’s when sensitive data quietly leaks into dev unless you catch it.
Use Structural’s schema change alerts to:
- Automatically detect new tables/columns.
- Trigger discovery rules on new fields.
- Notify owners when:
- A new column is classified as PII/PHI.
- A previously non-sensitive column starts matching sensitive patterns.
This builds continuous compliance into your CI/CD and schema migration workflows.
Step 6: Operationalize Discovery in Your Test Data Pipelines
Once you trust your rules, wire them into your standard workflows so sensitive data discovery just happens.
6.1 Bake discovery into refresh jobs
For every environment refresh or subset creation:
- Run Structural’s discovery step first.
- Apply transforms bound to sensitivity tags.
- Generate:
- Full test databases for staging.
- Smaller, referentially intact subsets for QA and local dev.
Because discovery rules are declarative, you don’t need to re-approve every dataset manually—teams get updated, safe data as part of their normal cycle.
6.2 Integrate with CI/CD and infra-as-code
Tie discovery runs to:
- Database migration pipelines (e.g., whenever Alembic/Flyway migrates schemas).
- Infrastructure changes to data stores.
- New microservice onboarding (when they register new tables).
You can automate via Tonic Structural’s API or SDK so discovery and generation are triggered alongside builds, not as a separate, manual step.
6.3 Support AI and analytics workflows
For AI model training and RAG pipelines, discovery rules:
- Identify PII/PHI in structured sources feeding feature stores.
- Tag fields that must be de-identified before exporting to:
- Data lakes backing ML platforms.
- External vendors or shared sandboxes.
When combined with Tonic Textual for unstructured data, you get consistent PII/PHI handling across both tables and text.
Ideal Use Cases
- Best for regulated orgs modernizing test data: Because it automates PII/PHI detection across sprawling schemas, so you can replace ad-hoc masking scripts and reduce the number of unofficial production copies in lower environments.
- Best for teams unblocking AI while satisfying compliance: Because it gives you a deterministic way to strip identifiers out of training data and feature stores without destroying the distributions and relationships your models rely on.
Limitations & Considerations
- Discovery is only as good as your rules and visibility: If you severely restrict value sampling or have highly non-standard naming, you’ll rely more on carefully-crafted custom rules and upstream metadata. Plan for a few iterations.
- PII/PHI definitions vary by jurisdiction and risk appetite: Structural gives you the tools, but your legal/compliance teams define what must be tagged. Align on classification policy before you lock in transforms.
Frequently Asked Questions
Do we have to tag every PII/PHI column manually in Tonic Structural?
Short Answer: No. You can start with built-in discovery and then layer automated sensitivity rules so most PII/PHI is tagged automatically.
Details: Structural’s schema scanning gives you an initial pass at likely sensitive columns. From there, you define custom rules based on column names, table context, data types, and value patterns. Over time, the combination of base discovery + your ruleset will cover the vast majority of PII/PHI with minimal manual intervention; you only review edge cases and new schema changes.
How does Tonic Structural handle newly added columns that contain PII/PHI?
Short Answer: Structural detects schema changes, re-runs discovery on new columns, and applies your sensitivity rules automatically.
Details: When a database schema changes—new tables, columns, or altered types—Structural’s schema change alerts kick in. Discovery rules run against those new fields, tagging any PII/PHI they match and applying the associated transforms (masking, synthesis, encryption). You can configure notifications so data owners and security teams are aware when new sensitive fields appear, but the protections themselves are applied automatically in your test data pipelines.
Summary
Setting up sensitive data discovery rules in Tonic Structural for PII/PHI is about turning privacy into an engineering workflow instead of a spreadsheet and a set of tribal norms. You:
- Define your PII/PHI model and protection levels.
- Use Structural’s built-in discovery to map likely sensitive columns.
- Encode your domain knowledge into custom name, type, pattern, and metadata rules.
- Bind sensitivity tags to high-fidelity transforms that preserve referential integrity and statistical properties.
- Continuously validate, alert on schema changes, and integrate discovery into your refresh and CI/CD workflows.
The payoff is concrete: production-shaped, privacy-safe datasets that hydrate staging and QA, unblock AI initiatives, and reduce escaped defects—without copying raw PII/PHI into places it doesn’t belong.