
How do we use Tonic Structural to subset a huge database for local dev without breaking relationships?
Shipping from a massive production database to a small, fast local dev environment is where most teams either break their app—or break their privacy posture. You need a subset that’s small enough to run on a laptop, realistic enough to flush out real bugs, and safe enough that copying it around doesn’t create a compliance nightmare. That’s exactly the workflow Tonic Structural is built to automate.
Quick Answer: Tonic Structural lets you subset huge production databases into small, referentially intact datasets that behave like production in your app—without exposing raw sensitive data. You define the slice and privacy rules; Structural’s patented subsetter pulls the right rows and dependencies while preserving relationships and statistical properties.
The Quick Overview
- What It Is: Tonic Structural is a test data platform for structured and semi-structured data that transforms sensitive production databases into high‑fidelity, privacy-safe subsets for dev, QA, and staging—while preserving referential integrity and realistic behavior.
- Who It Is For: Engineering, QA, and data platform teams that need production-like data for local development and CI/CD pipelines, but can’t safely snapshot full production into lower environments.
- Core Problem Solved: Structural solves the “giant prod DB vs. tiny dev environment” problem by automatically subsetting and de-identifying production data so you can ship faster without breaking foreign keys or leaking PII/PHI.
How It Works
At a high level, you connect Tonic Structural to your production (or prod-adjacent) database, define what “local dev subset” means for your team, and let Structural generate a smaller, consistent dataset that keeps joins, constraints, and application logic intact. Under the hood, it uses a patented graph-based subsetter and privacy-aware transformations to ensure every pulled record brings the minimum necessary dependencies—no orphaned rows, no broken relationships.
-
Connect & Profile Your Source: Structural connects to your database (on-prem or cloud) and automatically profiles schemas, relationships, and sensitive fields. It builds an internal graph of tables, foreign keys, and dependency chains so subsetting doesn’t become a fragile manual script.
-
Define Subset Scope & Privacy Rules: You specify how small and how representative your local dev dataset should be—by row counts, business rules (e.g., “only the last 30 days”), or seed entities (e.g., specific customers or regions). In the same project you define de-identification, synthesis, or masking rules so PII/PHI is transformed before it ever lands in dev.
-
Generate, Refresh, and Reuse Subsets: Structural executes its subset plan, pulling only the required rows and their dependencies while maintaining referential integrity. It applies your transformations, then delivers a production-shaped, smaller dataset you can hydrate into local dev. The same configuration can be re-run on demand or wired into CI/CD to keep environments fresh.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Patented graph-based subsetting | Analyzes table relationships and foreign keys to pull only the minimal necessary rows and related entities for your subset. | Maintains referential integrity so joins, constraints, and app logic behave like production—even in a tiny local dev copy. |
| Integrated de-identification & synthesis | Applies deterministic masking, format-preserving encryption, and synthetic replacements across related tables. | Removes sensitive data while preserving cross-table consistency and statistical properties, so tests still catch real issues. |
| Reusable subset blueprints & automation | Stores subsetting + transformation rules as reusable configurations, with API/CLI integration for pipelines. | Makes local dev refresh a one-click or automated step in CI/CD, instead of a ticket and a week-long copy job. |
Ideal Use Cases
- Best for local dev on laptops or small containers: Because it can take an 8 PB-scale production system and produce a referentially intact 1 GB subset that still mirrors real-world complexity and edge cases.
- Best for shared dev/staging environments with many teams: Because Structural lets you standardize one “golden subset” configuration that anyone can refresh on demand, reducing coordination overhead and drift across environments.
How to Use Tonic Structural to Subset a Huge Database for Local Dev
Let’s walk through the practical workflow teams use when they ask: “How do we subset our huge production database for local dev without breaking relationships?”
1. Connect Structural to Your Source Database
You start by connecting Tonic Structural to a prod or prod-like database:
- Supported across common production engines (e.g., Postgres, MySQL, SQL Server, Snowflake, etc.).
- Deployed as Tonic Cloud or self-hosted, so you can keep data inside your own VPC where required.
- Permissions scoped to the schemas you want to transform.
Once connected, Structural profiles:
- Tables and columns
- Existing foreign keys and inferred relationships
- Sensitive fields (PII/PHI/PCI) via built-in detection
This profiling is what powers the graph view of its subsetter: Structural sees your database as a dependency network, not a flat list of tables.
2. Let Structural Discover Relationships (So You Don’t Have To)
Subsetting is hard because dependencies hide everywhere:
- Explicit foreign keys
- Implicit relationships (IDs reused without constraints)
- Many-to-many junction tables
- Nested dependencies several hops away from a root entity
Structural’s graph-based subsetter traverses these relationships so it can:
- Understand which tables must be included for your app to function.
- Calculate the minimum set of rows that keeps referential integrity.
- Avoid “partial copies” that cause 500 errors when the app joins to missing data.
You can review and adjust this model in the UI, but you don’t maintain it by hand—Structural keeps it up to date as schemas evolve, with schema change alerts so new sensitive columns don’t quietly slip into dev.
3. Define Your Subset Strategy for Local Dev
Next, you define what “local dev-sized” means for your environment. Common patterns:
- Row-based limits:
“Give me 5,000 customers plus all their orders, payments, and support tickets.” - Time-based slices:
“Subset the last 30 days of activity across all event tables plus reference data.” - Rule-based cohorts:
“Only include EU users and their data,” or “Only include test merchants tagged ‘sandbox’.”
In Structural, you configure:
- Seed tables/rows – the starting points for the subset graph (e.g.,
customers,accounts,tenants). - Selection rules – SQL filters, percentages, or fixed counts for each seed.
- Propagation depth – how far dependencies should be pulled along the graph.
Structural then calculates the full subset plan: given those seeds, which rows in which tables are required to keep relationships intact.
4. Configure Privacy: De-identification, Masking, and Synthesis
Now you address the second half of the problem: you can’t just shrink production—you need to make it safe.
Within the same Structural project, you apply transformations such as:
- Deterministic masking
Replace the same email in multiple tables with the same fake email, so joins and application logic based on identity still work. - Format-preserving encryption
Encrypt card numbers or IDs while keeping length and format constraints intact. - Synthetic value generation
Replace names, addresses, or freeform text with synthetic alternatives that look real but don’t map back to actual customers. - Custom sensitivity rules
Define your own patterns and tags to ensure domain-specific sensitive attributes are always transformed.
Because these transforms are applied after subsetting, you get:
- High-fidelity structure and distributions (mirrors production complexity).
- No raw PII/PHI leaking into developer laptops or shared test DBs.
- Cross-table consistency (the same original identity maps to the same fake identity everywhere).
5. Generate the Subset and Hydrate Local Dev
With subsetting and transformation rules defined, you run the job:
- Structural executes the subset plan against the source.
- It pulls the relevant rows, maintains referential integrity, and applies transformations in one pass.
- It then outputs the subset in the form your dev teams need—often as a separate database instance ready to be hydrated into Docker, local Postgres/MySQL, or a dev/staging cluster.
Key properties of the resulting dataset:
- Smaller footprint: Designed to run comfortably on laptops or small containers.
- Referentially intact: Foreign keys, joins, and constraints still work.
- Statistically realistic: Distributions and patterns are preserved so performance, edge cases, and analytics behave like production.
- Safe to copy: Sensitive data is de-identified, so local snapshots and backups don’t expand your breach surface area with real PII.
6. Automate Refreshes via CI/CD and APIs
Once you’ve dialed in a useful subset for local dev, you don’t want to manually rebuild it every time.
Structural treats your configuration as a reusable “subset blueprint”:
- Re-run the same job via UI, CLI, Python SDK, or REST API.
- Integrate with CI/CD (e.g., a nightly job or per-branch environment spin-up).
- Version and evolve your strategy as schemas, test needs, and privacy requirements change.
Teams use this to:
- Keep local dev environments close to production reality without manual data pulls.
- Shorten feedback loops by avoiding ticket-based refreshes.
- Ensure every environment stays within privacy and compliance guardrails by default.
Customers see outcomes like:
- 75% faster test data generation.
- 25% uplift in developer productivity because engineers aren’t blocked waiting for usable data.
- Massive reductions in dataset size (e.g., an 8 PB environment subsetted down to ~1 GB) without losing relational behavior.
Limitations & Considerations
-
Subset fidelity vs. size tradeoff:
If you try to shrink too aggressively (e.g., only a handful of entities in a highly connected graph), some edge-case behaviors may disappear. The fix is to tune your selection rules and seed size until the subset still reflects real-world complexity. -
Schema and relationship quality:
Structural can infer a lot, but messy schemas with no foreign keys and inconsistent ID patterns can require more upfront configuration. Plan for an initial pass to define custom relationships and validate the graph, then you benefit from automation going forward.
Pricing & Plans
Tonic’s pricing is designed for teams who treat test data as part of their delivery pipeline, not an afterthought. Plans vary based on:
- Number and size of source systems you connect.
- Deployment model (Tonic Cloud vs. self-hosted).
- Required features (e.g., enterprise SSO/SAML, advanced governance, multi-region).
Typical patterns:
- Growth / Team Plan: Best for fast-moving engineering teams needing reliable, safe subsets for local dev and QA across a handful of core databases.
- Enterprise Plan: Best for larger organizations with multiple regulated data sources, complex CI/CD needs, and strict compliance requirements (SOC 2 Type II, HIPAA, GDPR, AWS Qualified Software).
For precise pricing and plan details, you’ll want to talk directly with Tonic’s team.
Frequently Asked Questions
How is Tonic Structural different from writing our own SQL subsetting scripts?
Short Answer: Structural automates subsetting and de-identification with a graph-based engine that preserves relationships and privacy; SQL scripts are brittle, manual, and hard to scale.
Details: DIY scripts usually start as a one-off export and quickly turn into an unmaintainable tangle:
- Every schema change requires hand edits and testing.
- Hidden dependencies and implicit relationships are easy to miss, leading to broken joins and failing tests.
- Privacy is bolted on with ad-hoc masking, which often breaks formats or inconsistencies across tables.
Tonic Structural centralizes this into a single, repeatable workflow:
- Discovers and maintains a graph of relationships.
- Calculates minimal, referentially intact subsets automatically.
- Applies consistent de-identification and synthesis across all related tables.
- Exposes everything through UI, API, and automation hooks so test data is part of your CI/CD, not a side project.
Can we use Tonic Structural for both local dev and larger staging environments?
Short Answer: Yes, you can use the same Structural configuration to generate differently sized subsets for laptops, shared dev, and staging.
Details: Structural separates:
- Selection logic (what portion of production you want) from
- Transformation logic (how you de-identify and synthesize data).
That means you can:
- Define one set of privacy rules for a domain (e.g., customer data).
- Create multiple subset profiles, like:
- A small, fast subset for local dev (e.g., 5k customers).
- A medium subset for shared dev/QA.
- A larger, near-complete subset for performance or regression testing.
All share the same privacy posture and transformation rules, so you don’t have fragmented approaches across environments.
Summary
Tonic Structural gives you a practical way to subset a huge production database for local dev without breaking relationships or violating privacy. It uses a graph-based engine to preserve referential integrity, built-in de-identification and synthesis to keep data safe, and reusable blueprints so refreshing dev data is an automated step in your pipeline—not a manual fire drill.
Instead of choosing between unsafe production clones and useless dummy data, you get small, realistic, production-shaped datasets that unlock developer velocity and keep governance intact.