
Structify API: how do I create a dataset schema, run a document processing job, and pull results programmatically?
Most teams hit the same wall the moment they try to operationalize Structify: “This works great in the UI, but how do I wire it into our own systems so we can push files in and pull structured data out automatically?” The answer is a simple flow: define your dataset schema, kick off a document processing job, and fetch results over the Structify API.
Quick Answer: With the Structify API, you (1) define a dataset schema that describes the structured fields you want, (2) start a document processing job that ingests PDFs, spreadsheets, presentations, and more, and (3) pull results programmatically as structured JSON or rows ready for your warehouse. No custom parsers, no spreadsheet gymnastics, and no manual copy-paste.
Why This Matters
If you’re still downloading PDFs from vendors, exporting CSVs from tools, and manually keying data into your CRM or warehouse, you’re burning hours on work that should happen in the background. Structify’s API lets you turn unstructured documents into consistent, analysis-ready data as part of your existing pipelines—so RevOps, GTM, and data teams can answer revenue questions faster.
Instead of treating every new contract template, pricing sheet, or competitor one-pager as a one-off project, you define the schema once and let Structify handle the ugly parts (extracting tables, text, numbers, and charts). Then you consume it like any other clean dataset.
Key Benefits:
- Repeatable, schema-driven extraction: Define exactly which fields you care about and keep them stable as documents change.
- Automated document processing at scale: Turn contracts, call transcripts, decks, and spreadsheets into structured records in minutes, not days.
- Programmatic access to results: Pull structured outputs directly into your data warehouse, CRM, or internal tools without manual exports.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Dataset schema | A structured definition of the fields, types, and relationships you want Structify to extract into a dataset. | Keeps your outputs consistent and queryable across thousands of documents and evolving templates. |
| Document processing job | An asynchronous job that takes one or more documents (PDFs, spreadsheets, presentations, transcripts) and transforms them into structured data based on your schema. | Automates the “read, copy, paste, normalize” work your team is currently doing by hand. |
| Programmatic results access | API endpoints that let you fetch processed records, monitor job status, and sync data into your own systems. | Turns Structify into part of your data pipeline instead of a one-off tool someone has to log into. |
How It Works (Step-by-Step)
At a high level, the Structify API follows the same three-step flow Structify uses everywhere: Bring In → Clean/Merge/Analyze → Visualize/Share. For document processing, that looks like:
- Define your dataset schema (what you want out).
- Run a document processing job (what you’re sending in).
- Pull results programmatically (how you consume the output).
Below is an end-to-end example using pseudo-REST calls and JSON payloads. Exact routes and auth details will be in your Structify API docs, but the flow will follow this pattern.
1. Create a Dataset Schema
First, you define the shape of the structured data you want. Think “table definition,” not “AI prompt.”
You might create a contracts dataset with fields like:
counterparty_name(string)start_date(date)end_date(date)total_contract_value(number)auto_renew(boolean)product_list(array of strings)payment_terms_days(integer)
Example: create a dataset schema
POST /api/v1/datasets
Authorization: Bearer <API_TOKEN>
Content-Type: application/json
{
"name": "contracts",
"description": "Structured contract metadata extracted from uploaded PDFs.",
"fields": [
{
"name": "counterparty_name",
"type": "string",
"description": "Legal name of the customer or vendor."
},
{
"name": "start_date",
"type": "date",
"description": "Contract effective start date."
},
{
"name": "end_date",
"type": "date",
"description": "Contract end or renewal date."
},
{
"name": "total_contract_value",
"type": "number",
"description": "Total recurring + one-time value in contract currency."
},
{
"name": "auto_renew",
"type": "boolean",
"description": "Whether the contract auto-renews by default."
},
{
"name": "product_list",
"type": "array<string>",
"description": "List of SKUs or product names included in the contract."
},
{
"name": "payment_terms_days",
"type": "integer",
"description": "Days between invoice and payment due date."
}
]
}
Expected response
{
"id": "ds_123456789",
"name": "contracts",
"status": "active",
"created_at": "2026-04-12T10:15:00Z"
}
You’ll use ds_123456789 as the dataset ID when creating jobs and fetching results.
Structify’s semantic layer keeps this schema aligned with your broader “business wiki,” so you’re not reinventing definitions every time. That means fewer broken dashboards and fewer “what does this field actually mean?” threads in Slack.
2. Run a Document Processing Job
With the dataset schema in place, you tell Structify: “Here are the files; extract them into this dataset.” Structify handles document parsing: reading PDFs, spreadsheets, presentations, call transcripts, and pulling out tables, text, numbers, charts, and other structured elements.
You can either:
- Upload raw files directly, or
- Provide URLs/paths to files stored in S3/GCS/internal storage (depending on your setup).
2.1 Upload a document (if needed)
POST /api/v1/files
Authorization: Bearer <API_TOKEN>
Content-Type: multipart/form-data
file=@/path/to/contract_acme_2026.pdf
Response
{
"file_id": "file_987654321",
"filename": "contract_acme_2026.pdf",
"status": "uploaded"
}
Repeat for as many documents as you need, or use a bulk upload endpoint if available.
2.2 Start a document processing job
POST /api/v1/document-jobs
Authorization: Bearer <API_TOKEN>
Content-Type: application/json
{
"dataset_id": "ds_123456789",
"files": [
{ "file_id": "file_987654321", "external_id": "ACME-2026-MSA" },
{ "file_id": "file_987654322", "external_id": "ACME-2026-ORDER-FORM" }
],
"options": {
"deduplicate_by": ["counterparty_name", "start_date", "end_date"],
"merge_related_files": true,
"timezone": "UTC"
}
}
external_idlets you tie Structify’s records back to your CRM or contract system.deduplicate_byhints how Structify should merge/normalize overlapping records.merge_related_filesis useful when one deal spans multiple documents (MSA + SOW + order form).
Response
{
"job_id": "job_246813579",
"dataset_id": "ds_123456789",
"status": "queued",
"created_at": "2026-04-12T10:20:00Z"
}
2.3 Poll job status
GET /api/v1/document-jobs/job_246813579
Authorization: Bearer <API_TOKEN>
Example response (completed)
{
"job_id": "job_246813579",
"dataset_id": "ds_123456789",
"status": "completed",
"started_at": "2026-04-12T10:20:05Z",
"completed_at": "2026-04-12T10:22:30Z",
"summary": {
"files_processed": 2,
"records_created": 2,
"records_merged": 0,
"errors": []
}
}
At this point, your documents have been turned into structured rows in the contracts dataset.
3. Pull Results Programmatically
Once the job is complete, you can fetch records via the API and feed them into your warehouse, internal dashboards, or revenue workflows.
3.1 Fetch all records from a dataset
GET /api/v1/datasets/ds_123456789/records?limit=100&offset=0
Authorization: Bearer <API_TOKEN>
Example response
{
"dataset_id": "ds_123456789",
"records": [
{
"record_id": "rec_001",
"external_id": "ACME-2026-MSA",
"data": {
"counterparty_name": "Acme Corporation",
"start_date": "2026-01-01",
"end_date": "2028-01-01",
"total_contract_value": 350000.00,
"auto_renew": true,
"product_list": ["Enterprise Platform", "Premium Support"],
"payment_terms_days": 30
},
"source": {
"file_id": "file_987654321",
"filename": "contract_acme_2026.pdf"
},
"created_at": "2026-04-12T10:22:25Z"
}
],
"pagination": {
"limit": 100,
"offset": 0,
"has_more": false
}
}
You can iterate with offset or use cursor-based pagination depending on your plan.
3.2 Filter by job or time window
If you’re syncing regularly, you typically only want new/updated records.
GET /api/v1/datasets/ds_123456789/records?since=2026-04-12T00:00:00Z
Authorization: Bearer <API_TOKEN>
Or filter by job:
GET /api/v1/document-jobs/job_246813579/records
Authorization: Bearer <API_TOKEN>
3.3 Upsert into your warehouse or CRM
Once you have structured JSON, you can:
- Upsert into a
contractstable in Snowflake/BigQuery/Redshift. - Enrich accounts/opportunities in Salesforce or HubSpot.
- Feed your internal pricing, churn, or expansion models.
This is where Structify’s broader value kicks in: those contract terms can now sit next to CRM fields, support tickets, call transcripts, and competitor intel in one place—so you can ask questions like “Which products and contract terms correlate with higher upsell rates?”
Common Mistakes to Avoid
- Treating schemas as one-off experiments: If you tweak field names/types every week, your downstream models and dashboards will constantly break. Decide on a stable schema upfront and evolve it intentionally, the same way you’d manage your CRM or warehouse schema.
- Ignoring document diversity: Not all PDFs are created equal—scanned documents, different languages, wildly different templates. Bake in error handling and monitoring so you can catch outliers and refine your schema or processing options instead of assuming 100% perfect extraction.
Real-World Example
A RevOps lead at a B2B SaaS company wanted every enterprise contract and order form in a structured format so they could finally answer: “Which products and terms predict expansion and lower churn?” Historically, that meant weeks of manual review across hundreds of PDFs.
With Structify, they:
- Created a
contractsdataset schema with fields for ACV, products purchased, contract length, payment terms, and renewal clauses. - Ran a document processing job across their historical contract library—MSAs, SOWs, order forms—stored in cloud drives and internal systems.
- Pulled the structured output via API directly into Snowflake, joined it with Salesforce opportunities and Zendesk tickets, and surfaced everything in dashboards that don’t need manual updates.
What used to be a one-off “let’s hire interns to read contracts” project turned into a reusable pipeline. Instead of guessing why some accounts expand faster, they could see patterns across contract terms, product mix, and support history.
Pro Tip: When you define your dataset schema, mirror the fields you actually use in analysis and reporting (ACV, renewal date, product SKUs, region, segment). That way, the JSON coming out of Structify drops straight into your existing models and dashboards—no extra transformation layer needed.
Summary
Using the Structify API to create dataset schemas, run document processing jobs, and pull results programmatically turns unstructured docs into a reliable, queryable data source. You define the fields once, let Structify handle parsing and normalization, and wire the output into your warehouse, CRM, or internal tools.
The payoff is simple: faster answers to hard revenue questions—without more manual exports, copy-paste, or custom parsing scripts. You get the same “conversation, not a query builder” experience Structify offers in the UI, but embedded directly into your data and revenue workflows.