How do we set up sensitive data discovery rules in Tonic Structural for PII/PHI?
Synthetic Test Data Platforms

How do we set up sensitive data discovery rules in Tonic Structural for PII/PHI?

8 min read

Engineering teams run into the same brick wall over and over: you need production-shaped data in lower environments, but you can’t safely move raw PII and PHI downstream. Sensitive data discovery rules in Tonic Structural are how you break that deadlock. You describe what “sensitive” means for your org, Structural finds it automatically, and every downstream de-identification or synthesis run stays aligned with your privacy model.

Quick Answer: In Tonic Structural, you set up sensitive data discovery for PII/PHI by combining built-in classifiers with custom sensitivity rules tied to patterns, column metadata, and data samples. Once configured, Structural automatically flags regulated fields and routes them through the right de-identification and synthesis transforms on every run.

The Quick Overview

  • What It Is: A configurable sensitive data discovery and classification system inside Tonic Structural that automatically identifies PII/PHI across your structured and semi-structured sources.
  • Who It Is For: Data, security, and platform teams who need high-fidelity test and AI training data without manually hunting for social security numbers, medical IDs, or other regulated fields.
  • Core Problem Solved: You stop relying on ad-hoc knowledge and brittle scripts to find sensitive columns, and instead build a repeatable, governed discovery pipeline that keeps pace with schema changes.

How It Works

Tonic Structural sits between your production data stores and your lower environments. Before it ever generates a de-identified or synthetic dataset, it runs discovery rules against schemas and samples of your data to classify sensitive fields. These rules can use:

  • Built-in detection for common PII/PHI patterns
  • Column names, data types, and tags
  • Regular expressions and custom logic on sample values

Once a field is classified as sensitive, Structural applies your chosen transforms (masking, synthesis, format-preserving encryption, etc.) while preserving referential integrity and statistical properties. The result: realistic, compliant datasets without dragging PII/PHI into dev, QA, or AI pipelines.

  1. Scan & Catalog: Structural connects to your databases, warehouses, or lakes, ingests schemas, and runs automated discovery to produce an initial sensitivity map.
  2. Refine with Custom Rules: You add or adjust sensitivity rules for your specific PII/PHI definitions—like plan IDs, internal customer keys, or region-specific identifiers—and map them to transforms.
  3. Enforce in Generation: Every time you generate a subset or synthetic dataset, Structural re-runs discovery with your rules, applies the right transforms, and alerts on new or drifting sensitive fields.

Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
Built-in PII/PHI classifiersUses out-of-the-box patterns and heuristics to detect common sensitive fields (names, emails, SSNs, medical IDs) during schema scans.Cuts setup time and finds obvious exposure points without bespoke configuration.
Custom sensitivity rulesLets you define your own discovery logic based on column names, types, tags, and regex against sample data.Aligns Structural with your exact risk model and domain-specific PII/PHI.
Schema change & drift awarenessRe-runs discovery and flags new or modified columns that match your rules on subsequent runs.Prevents silent leaks from new features, schema updates, or vendor changes.

Ideal Use Cases

  • Best for staging and QA environment refreshes: Because it enforces consistent PII/PHI discovery before every generation, keeping non-production environments realistic but safe.
  • Best for AI training and analytics sandboxes: Because it identifies and removes PII/PHI from structured sources feeding your models, without flattening distributions or breaking joins your feature pipelines depend on.

Limitations & Considerations

  • Discovery isn’t magic: Automated rules will surface the majority of sensitive fields, but you still need a one-time review with data owners to capture edge cases and confirm classification choices.
  • Unstructured data needs Textual: Structural is optimized for structured/semi-structured sources; for free text (tickets, notes, documents), pair it with Tonic Textual’s NER-powered detection and redaction pipeline.

Pricing & Plans

Tonic Structural is licensed as part of the broader Tonic suite, with options for cloud or self-hosted deployment and enterprise-grade security (SOC 2 Type II, HIPAA, GDPR, AWS Qualified Software). Pricing depends on footprint (data sources, scale) and required features.

  • Growth / Team: Best for engineering teams needing reliable test data de-identification across a handful of core databases and warehouses.
  • Enterprise: Best for regulated organizations needing large-scale coverage, SSO/SAML, advanced governance, and tight integration into CI/CD and AI workflows.

For detailed pricing and plan specifics, talk directly with the Tonic team.

Frequently Asked Questions

How do we actually create and manage sensitive data discovery rules for PII/PHI in Structural?

Short Answer: You combine Structural’s built-in classifiers with your own custom sensitivity rules, then tie those rules to transforms that run automatically in every generation job.

Details: A practical workflow looks like this:

  1. Connect your sources and run an initial scan

    • Point Structural at your production-like sources (e.g., Postgres, MySQL, SQL Server, Snowflake, BigQuery).
    • Run a discovery scan so Structural can:
      • Read schemas (tables, columns, types, constraints)
      • Sample data (within the access controls you define)
      • Apply built-in PII/PHI classifiers

    That produces a first pass at a sensitivity catalog: emails, names, phone numbers, SSNs, account IDs, etc.

  2. Review detected PII/PHI and refine classifications

    • Use Structural’s UI to inspect flagged columns and confirm or adjust their sensitivity level.
    • Decide which fields truly need strong protection (e.g., PHI, national IDs) vs. those you might treat as lower sensitivity (e.g., derived non-identifying codes).
  3. Define custom sensitivity rules for your domain
    Built-in classifiers won’t know your internal semantics; this is where custom rules matter. You typically encode rules along three axes:

    • Names and metadata:

      • Match column names: *_ssn, *_dob, patient_*, member_id
      • Match schemas or databases that are always sensitive (e.g., ehr_*, claims_*)
      • Leverage tags or comments you already maintain in your catalog
    • Data type and constraints:

      • Flag DATE or TIMESTAMP columns labeled as birth dates or admission dates
      • Flag integer or string keys that match your definition of “user identifier”
    • Pattern-based rules:

      • Apply regex to sampled values (e.g., US SSNs, MRN formats, ICD codes)
      • Use country or business-unit specific patterns

    In Structural, you capture these rules as “sensitivity rules” and assign them a category (e.g., PII, PHI, internal secret) and downstream treatment policy.

  4. Attach transforms to each sensitivity category
    Once discovery rules are in place, you map each sensitivity category to a default transform strategy, such as:

    • Names, emails, phone numbers: generate synthetic replacements that preserve format and cross-table consistency.
    • Account and patient IDs: apply deterministic masking or format-preserving encryption to keep joins working across tables.
    • Highly sensitive PHI (e.g., freeform diagnosis, rare IDs): use heavier synthesis or generalization to remove re-identification risk.

    Structural then uses these mappings to automatically configure transforms whenever a rule matches a column.

  5. Bake discovery into your generation jobs
    Sensitive data discovery isn’t a one-and-done exercise; schemas evolve. You should:

    • Configure Structural to re-run sensitivity rules on every generation job.
    • Enable schema change alerts so new columns that match your PII/PHI rules are immediately flagged and transformed, not silently passed through.
    • Integrate this into CI/CD so database migrations automatically trigger re-scans before being used to hydrate lower environments.
  6. Iterate with data owners and auditors

    • Share Structural’s sensitivity catalog with security, compliance, and key application owners.
    • Refine rules where needed (e.g., distinguishing clinical IDs from internal operational codes).
    • Document rules and mapping to transforms as part of your DPIA or internal data inventory.

Teams that take this approach typically move from ad-hoc, spreadsheet-based inventories to a living, enforceable discovery system embedded in their data pipeline. In customer deployments, that’s translated into things like a 75% faster path to usable test data and materially fewer escaped defects due to more realistic, regulated-safe data in lower environments.

How does sensitive data discovery in Structural support AI training and RAG workflows?

Short Answer: Structural removes PII/PHI from the structured and semi-structured sources feeding your AI pipelines, while Tonic Textual handles unstructured text before it’s ingested into vector stores or used for training.

Details: For AI workloads, you usually have two streams:

  1. Structured sources (features, labels, event logs):

    • Structural runs PII/PHI discovery rules across your tables and views.
    • Fields like user IDs, contact info, medical identifiers, and fine-grained timestamps are classified and transformed.
    • Because Structural preserves referential integrity and statistical properties, your models still see realistic distributions and relationships, but never see real identities.
  2. Unstructured sources (support tickets, notes, emails, PDFs):

    • This is where Tonic Textual comes in. You run a Textual workflow to:
      • Use proprietary NER models to detect sensitive entities in free text (names, addresses, medical details).
      • Apply redaction or reversible tokenization.
      • Optionally synthesize realistic replacements to keep semantic context intact for RAG and LLM training.

Combined, you get:

  • An end-to-end sanitization pipeline for both structured and unstructured data.
  • Continuous PII/PHI discovery that keeps up with schema changes and new document types.
  • AI datasets and knowledge bases that mirror production complexity without embedding live PII/PHI in your models.

Summary

Setting up sensitive data discovery rules for PII/PHI in Tonic Structural is about turning privacy into a first-class engineering workflow. You define what “sensitive” means for your business, encode that as rules on schema, metadata, and patterns, and let Structural enforce it every time it generates de-identified or synthetic datasets.

Instead of chasing shadow copies of production and hoping someone remembers which columns are dangerous, you get a system that:

  • Automatically finds PII/PHI across your structured landscape
  • Keeps up with schema changes and new features
  • Routes sensitive fields through transforms that preserve utility—cross-table consistency, statistical properties, and application behavior

That’s how teams accelerate releases, unblock AI initiatives, and respect data privacy as a human right, without slowing engineers down.

Next Step

Get Started