We have a multi-terabyte database—how can we create a smaller dataset for local dev that still behaves like prod?
Synthetic Test Data Platforms

We have a multi-terabyte database—how can we create a smaller dataset for local dev that still behaves like prod?

8 min read

Most engineering teams eventually hit the same wall: the database is multiple terabytes, local dev and CI are choking, but no one wants to test against fake toy data that bears no resemblance to production. You need a smaller dataset that still behaves like prod—same schemas, same relationships, same edge cases—without dragging a multi-TB anchor into every environment.

Quick Answer: Use relationship-aware subsetting and de-identification to carve out a “relationship-closed” slice of your database—starting from key entities like customers or tenants—so you preserve referential integrity, realistic distributions, and critical edge cases in a fraction of the footprint.


Quick Answer: Tonic Structural lets you subset a multi-terabyte database down to a small, referentially intact dataset for local development, while automatically de-identifying sensitive data so you’re not copying raw production into every laptop and CI job.

The Quick Overview

  • What It Is: A way to generate smaller, production-behaving datasets from multi-terabyte databases using relationship-aware subsetting, de-identification, and synthesis.
  • Who It Is For: Engineering, QA, and data teams who need fast local dev and CI against realistic data, but can’t keep dragging full production copies into lower environments.
  • Core Problem Solved: You get test and dev environments that mirror production behavior—joins, edge cases, workloads—without the cost, latency, and privacy risk of cloning the full multi-terabyte database.

Why naïve approaches break down

Before we get into how to fix this, it’s worth naming the failure modes you’ve probably already seen:

  • Random sampling drops critical relationships. Pulling a 1% sample from each table independently is fast, but it destroys referential integrity. Foreign keys point to missing rows, joins return empty sets, and your app logic behaves nothing like prod.
  • Hand-rolled scripts don’t scale. Writing custom SQL to chase dependencies works for a few tables. With hundreds of tables and complex dependency chains, you end up with brittle scripts that break every time the schema changes.
  • Full prod copies are unsustainable. Multi-terabyte snapshots into every dev and QA environment are slow, expensive, and risky. Storage costs balloon, refreshes take days, and every environment becomes a breach surface.
  • Overzealous masking ruins utility. Manual “scrub everything” masking can strip formats, break login flows, and flatten distributions so much that performance and edge-case bugs never show up until production.

The target state is clear: a smaller dataset that’s fast to spin up, cheap to run, safe to share—and still behaves like production.

How it works: relationship-aware subsetting for local dev

The core idea is simple: don’t think in terms of “rows per table,” think in terms of root entities and their full relationship neighborhoods.

Instead of taking a blind 1% from every table, you:

  1. Pick your roots
    Start from the entities that define your workflows: customers, accounts, tenants, merchants, etc. These are the rows that anchor real user journeys.

  2. Pull a relationship-closed slice
    From those roots, automatically traverse your foreign-key graph to pull in all dependent rows: orders, payments, tickets, events, sessions, addresses, etc. The subset is relationship-closed: foreign keys resolve, joins work, and you don’t get orphan rows.

  3. Preserve distributions and edge cases
    Within that slice, you ensure critical properties match production:

    • Distribution of order sizes and types
    • Mix of active/inactive customers
    • Rare error codes or payment statuses
    • Long-tail behaviors that often cause bugs
  4. De-identify sensitive data on the way out
    As you subset, you also transform PII/PHI and other sensitive fields:

    • Deterministic masking or format-preserving encryption for identifiers
    • Synthetic replacement for high-risk attributes (names, emails, addresses)
    • Rules that keep formats, uniqueness, and constraints intact
  5. Validate referential integrity and shape
    Finally, you run integrity and shape checks: foreign key constraints, join cardinalities, and assertions like “no orphan rows,” so the dataset behaves like prod when your test suite hits it.

This is the workflow Tonic Structural is built around.

Step-by-step with Tonic Structural

  1. Profile and map your database

    • Connect Structural to your source database (Postgres, MySQL, SQL Server, Snowflake, etc.).
    • It automatically detects schemas, foreign keys, and sensitivity patterns (emails, SSNs, phone numbers, etc.).
    • You get a graph view of your database so you can see how tables are actually connected.
  2. Define subsetting rules using roots

    • Choose “root” tables like customers, tenants, or accounts.
    • Filter them (e.g., “US customers only,” “top 100 tenants by activity,” “last 30 days of data”).
    • Structural’s patented subsetter walks the dependency graph from those roots and selects all related rows across downstream tables, preserving referential integrity.
  3. Apply de-identification and synthesis

    • Configure transformations per column:
      • Deterministic masking for IDs so the same entity maps consistently across tables and refreshes.
      • Format-preserving encryption where downstream systems validate shape and check digits.
      • Synthetic data generators for names, emails, addresses, and free-text fields that need privacy and realism.
    • Structural preserves cross-table consistency and statistical properties so joins still behave, and distributions still look like production.
  4. Generate and hydrate target environments

    • Execute the pipeline on a schedule or on-demand.
    • Hydrate:
      • Local dev databases (Dockerized DBs, developer machines)
      • Shared QA/staging environments
      • Ephemeral CI test databases spun up per branch or per pipeline run
    • Structural pushes data directly into your target DB or exports as SQL/CSV where needed.
  5. Monitor and adapt

    • Schema change alerts surface new columns or tables so sensitive data doesn’t “slip in through the side.”
    • You adjust subsetting rules as new workflows, features, or tenants become important to test.

Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
Relationship-aware subsettingBuilds “relationship-closed” slices around root entities and their dependentsLocal datasets that keep foreign keys and joins working like production.
Cross-table consistent transformsApplies deterministic masking, FPE, and synthesis while preserving relationshipsPrivacy-safe data that still passes application and validation logic.
Schema change alertsDetects and flags new or modified columns that may carry sensitive dataPrevents silent data leakage into lower environments over time.

Ideal use cases

  • Best for local development against realistic data: Because it gives each engineer a small, realistic dataset that mirrors production behavior without hauling multi-terabyte dumps onto laptops or breaking privacy rules.
  • Best for faster, reliable CI/CD testing: Because subsets stay referentially intact and statistically representative, so regression tests catch real-world issues without the runtime and storage penalty of full prod copies.

Limitations & considerations

  • You need clear root entities: Subsetting works best when you can anchor on customers/accounts/tenants. If your schema has weak or missing foreign keys, you’ll need to define relationships or fix them as part of the onboarding.
  • Not all analytics workloads fit tiny subsets: For global analytics (e.g., full-funnel conversion, rare anomaly detection), you may still need larger environments or fully synthetic data generated via Tonic Fabricate instead of a subset.

Pricing & plans

Tonic’s pricing is tiered by deployment model, data volume, and product mix (Structural, Fabricate, Textual), with enterprise support for regulated teams.

  • Team / Mid-Market: Best for engineering orgs that need safer local dev and staging data for a handful of critical systems, with predictable schemas and moderate refresh cadence.
  • Enterprise: Best for large, regulated organizations with multiple multi-terabyte databases, complex data estates, and requirements around self-hosting, SSO/SAML, SOC 2/HIPAA/GDPR alignment, and deep CI/CD integration.

(Details are tailored per account; teams typically see outcomes like 20x faster regression testing or 75% faster test data delivery when they replace manual workflows.)

Frequently asked questions

How small can we get our multi-terabyte database while keeping it useful?

Short Answer: Many teams reduce multi-terabyte sources down to gigabytes—enough for laptops and CI—while still preserving end-to-end workflows.

Details: The size reduction depends on how many root entities you include and how “wide” their dependency graph is. For example, one customer reduced an 8 PB environment down to a ~1 GB dataset by:

  • Selecting a small set of representative tenants as roots.
  • Limiting to recent activity windows for time-series data (e.g., last 30–60 days).
  • Excluding tables irrelevant to application behavior (e.g., raw telemetry backups). Because the subset is relationship-closed, that ~1 GB still behaves like production when the app and tests run.

How do we ensure we’re not leaking PII while doing all this?

Short Answer: You apply de-identification and synthesis as part of the subsetting pipeline, not as an afterthought.

Details: In Tonic Structural, every subset run is also a transformation run:

  • Columns tagged as sensitive (via profiling + your custom rules) are transformed with deterministic masking, format-preserving encryption, or synthetic generators.
  • Identifiers are transformed consistently across tables and across refreshes, so your fixtures and test expectations remain stable.
  • NER-powered pipelines in Tonic Textual handle unstructured fields (support tickets, notes) when those are part of your workflows. The result: your local dev and QA environments see realistic, production-shaped data—but no real customer identities.

Summary

If you’re sitting on a multi-terabyte database, the choice is not “full prod clone or fake toy dataset.” The scalable path is relationship-aware subsetting plus de-identification: start from root entities, pull a relationship-closed slice, preserve statistical properties, and transform sensitive data on the way out.

Tonic Structural turns that workflow into a repeatable, observable pipeline. You get smaller, cheaper, faster environments that still behave like production, while shrinking your breach surface and keeping compliance teams onside.

Next Step

Get Started