Structify API: how do I create a dataset schema, run a document processing job, and pull results programmatically?
AI Revenue Analytics

Structify API: how do I create a dataset schema, run a document processing job, and pull results programmatically?

9 min read

Most teams hit the same wall with Structify-style workflows: they know Structify can rip through PDFs and decks, but they want the whole thing automated—define a schema once, send documents in, and pull structured results out via the Structify API without ever touching the UI. This guide walks through that end‑to‑end flow: how to create a dataset schema, run a document processing job, and fetch the output programmatically.

Quick Answer: With the Structify API, you 1) define a dataset schema that describes the structured fields you want, 2) submit one or more documents to a processing job that uses that schema, and 3) poll the job status and retrieve normalized results as JSON or push them into downstream tools. You set it up once, then keep sending documents through the same pipeline without manual work or SQL.

Why This Matters

If you’re still copying tables out of PDFs or patching together ad‑hoc scripts to parse contracts, you’re wasting the exact time you should be using to answer revenue questions—like “Which contracts have auto‑renew?” or “Which competitors show up most in lost deals?” A repeatable Structify API workflow means you can:

  • Standardize your document structure once (as a dataset schema).
  • Push anything from contracts to call transcripts through the same pipeline.
  • Pull analysis‑ready data directly into your CRM, data warehouse, or internal tools.

Instead of treating every new document set as a one‑off project, you get a system: connect → process → sync to your revenue stack automatically.

Key Benefits:

  • No more manual extraction: Replace copy‑paste and fragile scripts with a stable, schema‑driven pipeline.
  • Consistent, analysis‑ready data: Every document type maps to a well‑defined dataset you can trust across teams.
  • Faster revenue answers: Go from “we just got 200 new contracts” to “here’s the risk by renewal date and discount” in hours, not weeks.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Dataset schemaA structured definition of the fields, types, and relationships that describe how documents should be turned into rows/columns.It turns messy PDFs, decks, and transcripts into consistent tables you can join with CRM, billing, and marketing data.
Document processing jobAn API‑driven task that takes one or more documents, applies a schema, and outputs normalized structured data.This is how you automate extraction at scale—no UI, no manual uploads, just programmatic jobs.
Programmatic results accessFetching processed output (JSON, CSV, or streaming) via API to push into tools like Snowflake, HubSpot, or internal services.It closes the loop so Structify becomes part of your data pipeline, not a one‑off reporting layer.

How It Works (Step‑by‑Step)

At a high level, the Structify API flow looks like this:

  1. Model the data you want (create a dataset schema).
  2. Send documents to process (run a document processing job).
  3. Consume the output (pull results programmatically and sync downstream).

Below is a conceptual walkthrough with pseudo‑endpoints and payloads; adapt to your actual Structify API client or language of choice.


1. Create a Dataset Schema

First, decide what you actually need out of your documents. For example, if you’re processing sales contracts, you might want:

  • Contract ID
  • Customer name
  • Start and end dates
  • Renewal terms (auto‑renew, notice period)
  • Total contract value
  • Discount percentage
  • Competitors mentioned
  • Product SKUs

In Structify terms, that becomes a dataset schema: a blueprint that tells the platform how to structure extracted data.

Example: Create a contract dataset schema

POST /v1/datasets
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "name": "contract_terms",
  "description": "Structured fields extracted from customer contracts",
  "fields": [
    { "name": "contract_id", "type": "string", "required": true },
    { "name": "customer_name", "type": "string" },
    { "name": "effective_date", "type": "date" },
    { "name": "end_date", "type": "date" },
    { "name": "auto_renew", "type": "boolean" },
    { "name": "notice_period_days", "type": "integer" },
    { "name": "total_contract_value", "type": "number" },
    { "name": "currency", "type": "string" },
    { "name": "discount_percent", "type": "number" },
    { "name": "competitors_mentioned", "type": "array<string>" }
  ],
  "primary_key": ["contract_id"]
}

The API will respond with a dataset identifier:

{
  "dataset_id": "ds_1234567890",
  "name": "contract_terms",
  "status": "active"
}

You’ll reuse this dataset_id for all future document processing jobs that should conform to this schema.

Practical tips when designing schemas:

  • Think in joins, not documents. Choose fields that will join cleanly with CRM (e.g., account_id, opportunity_id) and billing (e.g., invoice_id).
  • Model arrays for multi‑value fields. Competitors, product SKUs, and signatories often belong as arrays, not comma‑separated strings.
  • Keep types strict. Using the right date, number, and boolean types up front saves you from endless downstream casting in Snowflake or dbt.

2. Run a Document Processing Job

Once you’ve defined the dataset, you can send documents to be processed. Structify is built to handle real‑world formats—PDFs, spreadsheets, and presentations—so you’re not limited to “clean” text.

Under the hood, Structify:

  • Extracts tables, text, numbers, and charts from your documents.
  • Normalizes and deduplicates where necessary.
  • Maps the content into your dataset schema, field by field.

You have two common integration patterns:

  1. Direct file upload with the job request.
  2. Pre‑upload file(s) to storage, then reference by URL or file ID.

Example: Submit a document processing job

POST /v1/jobs/document-processing
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "dataset_id": "ds_1234567890",
  "job_name": "q2_contract_batch",
  "documents": [
    {
      "document_id": "doc_001",
      "source_type": "url",
      "source": "https://your-bucket.s3.amazonaws.com/contracts/acme_contract_q2.pdf"
    },
    {
      "document_id": "doc_002",
      "source_type": "url",
      "source": "https://your-bucket.s3.amazonaws.com/contracts/zenith_msa.pdf"
    }
  ],
  "options": {
    "priority": "normal",
    "deduplicate_entities": true,
    "extract_tables": true,
    "extract_text": true
  }
}

Typical response:

{
  "job_id": "job_987654321",
  "status": "queued",
  "dataset_id": "ds_1234567890",
  "submitted_at": "2026-04-01T10:15:00Z"
}

Poll job status

Document processing is asynchronous. For large batches, Structify might be OCR’ing hundreds of pages and extracting multiple tables per file. Poll the job until it completes:

GET /v1/jobs/document-processing/job_987654321
Authorization: Bearer YOUR_API_KEY

Example response:

{
  "job_id": "job_987654321",
  "status": "completed",
  "dataset_id": "ds_1234567890",
  "metrics": {
    "documents_total": 2,
    "documents_succeeded": 2,
    "documents_failed": 0,
    "rows_extracted": 14
  },
  "completed_at": "2026-04-01T10:17:42Z"
}

If status is failed or partial, use the error details per document to decide whether to retry, fix source files, or adjust the schema.


3. Pull Results Programmatically

Once the job completes, you can fetch the processed data. This is the point where Structify stops being “a document tool” and becomes part of your core revenue pipeline—because you can now push this structured dataset into:

  • Your data warehouse (Snowflake, BigQuery, Redshift).
  • Your CRM (Salesforce, HubSpot, etc.).
  • Your analytics layer or internal apps.

Example: Retrieve job results

GET /v1/jobs/document-processing/job_987654321/results
Authorization: Bearer YOUR_API_KEY
Accept: application/json

Example results payload:

{
  "job_id": "job_987654321",
  "dataset_id": "ds_1234567890",
  "results": [
    {
      "document_id": "doc_001",
      "rows": [
        {
          "contract_id": "ACME-Q2-2026",
          "customer_name": "Acme Corporation",
          "effective_date": "2026-04-01",
          "end_date": "2027-03-31",
          "auto_renew": true,
          "notice_period_days": 60,
          "total_contract_value": 250000,
          "currency": "USD",
          "discount_percent": 15.0,
          "competitors_mentioned": ["Contoso", "Globex"]
        }
      ]
    },
    {
      "document_id": "doc_002",
      "rows": [
        {
          "contract_id": "ZENITH-MSA-2026",
          "customer_name": "Zenith Analytics",
          "effective_date": "2026-02-15",
          "end_date": "2028-02-14",
          "auto_renew": false,
          "notice_period_days": 0,
          "total_contract_value": 400000,
          "currency": "USD",
          "discount_percent": 10.0,
          "competitors_mentioned": []
        }
      ]
    }
  ]
}

Many teams then:

  • Stream this into a warehouse via their ingestion tool of choice (or a Structify connector).
  • Join it with CRM pipeline data to answer questions like “Which competitor is most associated with lost deals above $100K?”
  • Feed it into Slack workflows so execs can ask “How many contracts over $200K are up for renewal in the next 90 days?” and get an answer, not a ticket.

If you prefer a flat table view, you can query the dataset directly:

GET /v1/datasets/ds_1234567890/rows?limit=1000
Authorization: Bearer YOUR_API_KEY

Common Mistakes to Avoid

  • Treating each document type as a new schema:
    Instead of spinning up a new dataset for every minor variation, design flexible schemas with optional fields and arrays. This dramatically cuts maintenance and keeps analysis unified.

  • Skipping entity keys and relationships:
    If you don’t include IDs (account, opportunity, invoice, ticket) in your schema, you’ll end up with isolated tables that don’t join back to Salesforce, HubSpot, or Snowflake. Always plan for joins from day one.

  • Overloading one “catch‑all” dataset:
    The opposite problem: stuffing contracts, sales decks, survey PDFs, and call transcripts into a single dataset. Use a few well‑targeted datasets (e.g., contract_terms, competitive_mentions, call_outcomes) so your downstream models stay clean.


Real-World Example

Imagine you’re the RevOps lead at a B2B SaaS company and you’ve just acquired another vendor. You inherit:

  • 1,200 legacy contracts as PDFs and scanned images.
  • A pile of renewal schedules in Excel.
  • Slides from old pricing decks scattered across SharePoint.

Leadership asks two questions:

  1. “Which acquired customers are at risk because of aggressive discounts or short terms?”
  2. “Where did we discount heavily against specific competitors?”

You wire Structify into your stack as follows:

  1. Create two dataset schemas:

    • contract_terms for the core legal/financial details.
    • competitive_signals for any mention of competitors, pricing battle cards, and concessions.
  2. Kick off a document processing job that points Structify at your contract PDFs and pricing decks stored in S3. Structify extracts tables, text, numbers, and charts automatically—no one has to manually tag or pre‑format files.

  3. Pull the results programmatically into your warehouse and join them with:

    • Salesforce opportunities and accounts.
    • Your billing system’s MRR and churn data.

Within a day, you can answer:

  • “Show me all acquired accounts with discounts over 20% whose contracts end in the next 6 months.”
  • “Rank competitors by total ARR where they appear in contract or deck content.”

That’s the kind of cross‑source analysis Structify was built for: bridging PDFs, spreadsheets, presentations, and your structured systems, so you can stop guessing what’s driving (or blocking) revenue.

Pro Tip: Start with one tightly defined use case (e.g., contract renewals or competitive intel in lost‑deal decks), build a clean schema, and wire the API into that workflow. Once that’s stable, reuse the same pattern for other document types instead of designing everything at once.

Summary

To use the Structify API for document processing end‑to‑end, treat it like any other core data pipeline:

  • Create a dataset schema that reflects the structured fields you need for real revenue decisions—not just what’s easy to extract.
  • Run document processing jobs that turn PDFs, spreadsheets, and presentations into normalized rows mapped to that schema.
  • Pull results programmatically and sync them into your warehouse, CRM, and reporting layer so operators can ask questions in plain English (including in Slack) without waiting on manual extraction.

When you do this well, Structify stops being “a clever way to read PDFs” and becomes a foundation for answering the hard questions: why deals win or lose, where pipeline is leaking, and which customers are truly at risk.

Next Step

Get Started