
How do I connect Tonic Structural to our Postgres/MySQL database and run a first de-identified dataset?
Shipping your first de-identified dataset from Postgres or MySQL shouldn’t take days of schema archaeology and hand-rolled masking scripts. With Tonic Structural, you connect directly to your production database, auto-discover sensitive columns and relationships, choose transforms that preserve behavior, and write high-fidelity data into a safe target schema or database—usually in under an afternoon.
Quick Answer: Connect Tonic Structural to your Postgres/MySQL via a read-access source connection, define a secure target (new schema or database), let Structural profile your schema and apply privacy transforms, then run a generation job to produce a de-identified, production-shaped dataset you can hydrate into dev or staging.
The Quick Overview
- What It Is: A structured and semi-structured data de-identification and synthesis engine that connects to your existing Postgres/MySQL, discovers sensitive data, and generates high-fidelity test datasets that maintain referential integrity.
- Who It Is For: Engineering, QA, data, and platform teams that need production-like behavior in lower environments without copying raw production data.
- Core Problem Solved: Eliminates unsafe production clones and brittle masking scripts by automatically transforming live schemas into safe, realistic datasets that keep foreign keys and statistical properties intact.
How It Works
At a high level, Tonic Structural sits between your production database and your lower environments. It uses native connectors to profile your Postgres/MySQL, identify sensitive fields, and map cross-table relationships. You then configure privacy transforms (e.g., deterministic masking, synthesis, format-preserving encryption) and tell Structural where the de-identified data should land—a new schema, a separate database, or a CI/CD artifact. Structural writes the transformed data directly to that target, preserving referential integrity and core distributions so your apps and tests behave like they’re hitting production.
From first connection to first dataset, you can think in three phases:
-
Connect & Profile:
- Point Structural at your Postgres/MySQL with a least-privilege service account.
- Run automated profiling to capture schema, statistics, and foreign key relationships.
-
Configure Transforms & Targets:
- Define a target database or schema for de-identified output.
- Tag and transform sensitive fields using Tonic’s library of privacy-preserving generators.
-
Generate & Iterate:
- Run your first de-identified generation job.
- Validate behavior in dev/staging, then wire it into your regular refresh or CI/CD pipelines.
The rest of this guide walks through those steps in concrete terms so you can go from “we shouldn’t have prod in dev” to “we have a repeatable de-identified pipeline” without getting bogged down in custom scripts.
Step 1: Prepare Access to Postgres/MySQL
1.1 Create a database user with least-privilege access
For the source (your production or prod-like database):
- Access type:
- Read-only for initial profiling and pull-based generation.
- Optional limited write permissions if you later choose in-place masking (most teams don’t for lower environments; they generate into a dedicated target instead).
- Scope:
- Restrict to the schemas you intend Structural to read.
- Use IP allowlists / security groups so only your Structural deployment can connect.
Postgres example privileges (source):
CREATE ROLE tonic_structural_source LOGIN PASSWORD '********';
GRANT CONNECT ON DATABASE your_prod_db TO tonic_structural_source;
GRANT USAGE ON SCHEMA public TO tonic_structural_source;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO tonic_structural_source;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO tonic_structural_source;
MySQL example privileges (source):
CREATE USER 'tonic_structural_source'@'%' IDENTIFIED BY '********';
GRANT SELECT ON your_prod_db.* TO 'tonic_structural_source'@'%';
FLUSH PRIVILEGES;
For the target (where de-identified data will land):
- Create a separate database or schema (e.g.,
test_data,staging_masked). - Grant
CREATE/INSERT/UPDATE/DELETEon that schema only.
Postgres example privileges (target):
CREATE ROLE tonic_structural_target LOGIN PASSWORD '********';
CREATE SCHEMA test_data AUTHORIZATION tonic_structural_target;
GRANT CONNECT ON DATABASE your_nonprod_db TO tonic_structural_target;
GRANT USAGE ON SCHEMA test_data TO tonic_structural_target;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA test_data TO tonic_structural_target;
ALTER DEFAULT PRIVILEGES IN SCHEMA test_data
GRANT ALL ON TABLES TO tonic_structural_target;
MySQL example privileges (target):
CREATE USER 'tonic_structural_target'@'%' IDENTIFIED BY '********';
CREATE DATABASE test_data;
GRANT ALL PRIVILEGES ON test_data.* TO 'tonic_structural_target'@'%';
FLUSH PRIVILEGES;
This separation ensures Structural can’t accidentally mutate production, while still having full control over its own de-identified output.
Step 2: Deploy Tonic Structural
Tonic Structural is built to run where your data lives:
- Deployment options: Docker, Kubernetes, or cloud (AWS, GCP, Azure).
- Connectivity: Connects directly to Postgres/MySQL using native connectors.
- Security posture: Designed for regulated environments; customers commonly run Structural inside their VPC so no sensitive data ever leaves their control.
Typical deployment patterns:
- Single Docker container for a quick PoC against a non-production copy of Postgres/MySQL.
- Kubernetes deployment in your primary cloud for ongoing, scheduled refreshes and CI/CD integration.
- Cloud-hosted by Tonic (where allowed) with connectors back into your databases via VPN/peering and tight firewall rules.
Once Structural is up, you’ll access it via a web UI (and optionally via a REST API / SDK for automation).
Step 3: Connect Structural to Postgres/MySQL
3.1 Add a source connection
In the Tonic Structural UI:
- Navigate to Connections → Add Connection.
- Choose PostgreSQL or MySQL as the source type.
- Provide:
- Host, port
- Database name
- Username/password for
tonic_structural_source - Optional SSL options according to your org’s standards
- Test the connection to confirm network + credential access.
Structural will use this connection to:
- Scan schema metadata (tables, columns, types).
- Sample values and distributions to understand cardinality and patterns.
- Infer explicit and implicit foreign key relationships for cross-table consistency.
3.2 Add a target connection
Still in Connections:
- Add another connection for PostgreSQL or MySQL as the target.
- Use the
tonic_structural_targetuser and the database/schema you created for de-identified data. - Test and save.
You can point both source and target at the same physical server with different DBs/schemas or separate instances. Structurally, Tonic sees them as distinct “where to read” and “where to write.”
Step 4: Profile Your Database
This is where Tonic replaces days of manual schema spelunking.
4.1 Create a new project
- Go to Projects → New Project.
- Select your Postgres/MySQL source connection.
- Choose the schemas/tables you want to include (start with a focused subset if you’re just proving it out—e.g.,
customers,orders,payments).
Structural will:
- Capture full schema metadata (types, constraints, indexes).
- Compute column-level statistics (cardinality, null rates, distributions).
- Map foreign key and inferred relationships to preserve referential integrity.
4.2 Auto-detect sensitive data
Structural will automatically analyze columns to flag likely PII/PHI:
- Emails, phone numbers, addresses.
- Names, government IDs, payment details.
- Free-text columns with high risk of leakage.
You can:
- Review and confirm or override tags.
- Add custom sensitivity rules tailored to your domain (e.g., internal IDs that may still be re-identifiable via external datasets).
This sensitivity mapping drives what gets masked, synthesized, or otherwise transformed.
Step 5: Configure Privacy Transforms
This is the heart of getting “safe but still behaves like prod.”
5.1 Choose transforms per column
For each sensitive column, assign a transform from Structural’s library, such as:
- Deterministic masking: Same input → same masked output.
- Use when application logic relies on equality across tables (e.g., joining on email).
- Format-preserving encryption / hashing:
- Use when you need stable, opaque identifiers that preserve length/format but not meaning.
- Synthetic data generators:
- Realistic, NER-informed names, addresses, phone numbers that follow natural distributions, but don’t map to real individuals.
- Randomization / bucketing:
- For dates, ages, salaries—preserve ranges and distributions but remove exact values.
Structural applies these transforms while:
- Maintaining cross-table consistency (e.g., same user ID across
users,orders,support_tickets). - Protecting referential integrity so foreign keys still work and your app doesn’t throw constraint errors.
- Preserving statistical properties critical for realistic performance and regression testing.
5.2 Configure subsetting (optional but recommended)
If your production dataset is huge, you can define subsetting rules:
- Row-level filters: e.g., “only customers from last 90 days.”
- Relational subsetting: maintain all related rows across tables when you sample (e.g., keep all orders, payments, and events for a sampled set of users).
This is how some customers have shrunk multi-petabyte datasets down to GB-scale test datasets while keeping relational behavior intact.
Step 6: Configure the Target Dataset
Now define how the de-identified data will be written.
6.1 Map source to target
In the project settings:
- Select your target connection (Postgres/MySQL).
- Choose whether:
- Structural should create tables in the target schema to mirror source structure, or
- Map to existing tables in a pre-defined schema.
Most first runs use the “let Structural create the schema” path. This keeps things simple and ensures constraints and indexes are compatible with the transformed data.
6.2 Destination settings
Tonic Structural can:
- Write directly into:
- A test schema in an existing DB.
- A dedicated test database for dev/staging.
- Or produce artifacts for CI/CD (e.g., export data that can be baked into container images or seeded via SQL scripts in pipelines).
From the internal docs: Structural writes masked data directly to target databases via native connectors (PostgreSQL, MySQL, Snowflake, files) and can also export datasets as container images for seamless use in containerized workflows.
Pick the destination that matches how your teams hydrate lower environments today.
Step 7: Run Your First De-Identified Generation
7.1 Execute the job
In your Structural project:
- Click Generate Data (or equivalent run button).
- Confirm:
- The source connection.
- The target connection and schema.
- Any subsetting or filtering rules.
Structural will:
- Pull data from the source.
- Apply configured transforms in a streaming fashion.
- Write de-identified rows to the target while enforcing referential integrity.
7.2 Validate the output
Once the job completes:
- Point a dev or staging environment at the target Postgres/MySQL.
- Run:
- Core regression test suites.
- Key application flows (login, search, checkout, reporting).
- Verify:
- No real PII/PHI is visible.
- Foreign keys and joins work.
- Query performance and edge cases resemble production.
Teams routinely see test data generation go from days/weeks to hours. For example, Patterson boosted test data delivery speed by 75% and increased developer productivity by 25% after moving to this pattern.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Native Postgres/MySQL connectors | Connects directly to your databases and writes masked data to target schemas | Zero manual exports/imports; de-identified datasets land exactly where dev/staging needs them |
| Automated profiling & relationship mapping | Scans schema, statistics, and foreign keys across tables | Preserves referential integrity so apps and tests behave like they’re running on production |
| Configurable privacy transforms & subsetting | Applies deterministic masking, synthesis, and relational subsetting | Eliminates PII/PHI while keeping realistic distributions and smaller, performant test datasets |
Ideal Use Cases
-
Best for hydrating dev and staging with production-like data:
Because Structural keeps cross-table consistency and referential integrity intact, your environments behave like prod without shipping real PII/PHI downstream. -
Best for repeatable, compliant CI/CD pipelines:
Because you can script or schedule generation jobs and export container-ready datasets, you can rebuild clean environments on demand and keep compliance embedded in the release pipeline—not as an after-the-fact approval.
Limitations & Considerations
-
Network and access constraints:
Structural needs network-level access to your Postgres/MySQL instances. In highly locked-down environments, you’ll need coordination with infra/security to set up VPC peering, VPNs, or firewall rules. -
Initial transform design effort:
While Structural accelerates profiling, you still need to make deliberate choices about transforms per use case. The first project might take a few working sessions to fine-tune—but afterwards, refreshes are automated and repeatable.
Pricing & Plans
Tonic Structural is licensed on an annual basis, typically aligned with:
- Data volume and number of source/target environments.
- Deployment model (Tonic Cloud vs self-hosted).
- Required integrations and enterprise features (SSO/SAML, advanced governance, support).
Common patterns:
- Team Plan: Best for engineering teams needing 1–2 core databases de-identified for dev/staging, with straightforward schemas and a single CI/CD pipeline to integrate.
- Enterprise Plan: Best for platform/data teams responsible for multiple Postgres/MySQL instances (plus other stores), stringent compliance requirements, and integration into complex, multi-environment CI/CD and AI pipelines.
For exact pricing and plan fit, Tonic normally walks through your database footprint, compliance requirements, and refresh frequency.
Frequently Asked Questions
Do we have to expose production directly to Tonic Structural?
Short Answer: No, but it’s often the most efficient pattern; you can also point Structural at a “near-prod” source if that’s your policy.
Details:
Many teams run Structural inside their own VPC and connect it directly to production Postgres/MySQL with a read-only account. That gives you the most accurate profiling and the highest-fidelity test data. If your policies forbid any non-application process touching production, you can:
- Create a secure, internal replica of production and point Structural there.
- Keep the same connection pattern (source → target) while treating the replica as “source of truth for test data” instead of prod itself.
In all cases, you avoid ad-hoc CSV exports and local snapshots that are hard to govern and easy to leak.
How often can we refresh the de-identified dataset?
Short Answer: As often as your infrastructure allows—many teams schedule daily or per-branch refreshes.
Details:
Once your project is configured, refreshes are just new generation runs:
- You can schedule them to run nightly or weekly, keeping dev/staging reasonably close to production patterns.
- You can trigger them via CI/CD so every major release, or even every PR environment, gets a fresh, safe dataset.
- For heavy databases, you may parallelize or subset to keep runtime within your change windows.
Because Structural preserves schema and relationships and writes directly via native connectors, the bottleneck is typically your network and DB throughput, not the masking logic.
Summary
Connecting Tonic Structural to your Postgres/MySQL and running a first de-identified dataset boils down to:
- Give Structural secure, least-privilege connections to your source and target.
- Let it profile your schema and relationships so referential integrity is not an afterthought.
- Configure transforms that remove risk without breaking behavior.
- Run a generation job and validate that your dev/staging environment acts like prod—for your tests and applications, not your auditors.
The payoff is a repeatable pipeline: production-shaped data where your engineers need it, without the uncontrolled copies and privacy exposure that come with cloning raw production.