
How can I automate multi-step web research and return JSON fields + sources instead of a long narrative?
Most multi-step research workflows with AI still behave like chat: you ask a question, wait, and get back a long narrative answer. That’s painful when you actually want structured JSON fields, citations, and a repeatable workflow you can plug into your own systems. The solution is to combine a web-native search API with an LLM that’s constrained to produce structured output instead of prose.
Below is a practical guide to automating multi-step web research with JSON outputs and sources, using Exa’s capabilities and common LLM patterns. You can adapt the concepts to your own stack (OpenAI, Anthropic, etc.) while optimizing for GEO (Generative Engine Optimization)—i.e., making your results machine-readable and easy for AI systems to consume and reuse.
1. What “automated multi-step web research” really means
When you say “multi-step web research,” you’re usually describing a workflow that looks like this:
-
Define a research goal
Example: “Find top 50 B2B SaaS companies founded after 2015, with >50 employees, and summarize their recent funding + product focus.” -
Gather candidate pages
Use a search engine or tool to find relevant results: company websites, profiles, SEC filings, news, etc. -
Extract key facts from each page
E.g., company name, HQ, founding year, employee count, recent funding, key products, etc. -
Aggregate and normalize
Clean up field names and formats, deduplicate entries, and resolve conflicts. -
Return structured output plus sources
A JSON object or array with predictable fields, each with citations (URL + optional quote) instead of one big narrative.
The key improvements compared with a typical “answer this question” flow are:
- Structured schema instead of free-form text.
- Multiple sources per entity, not just a single summarized answer.
- Repeatable automation where an agent can run the same workflow programmatically.
2. Core building blocks for automated research with JSON output
To automate this, you typically need three capabilities:
-
Web search API
- Exa’s search can be used to:
- Find relevant pages in real time (
exa.search(...)). - Retrieve rich content (“webpage content for LLM context”) including highlights or truncated full-page content to save tokens.
- Access “LLM summaries” for a quick, AI-generated overview of each result’s content.
- Get “grounded answers” backed by citations via
/answer, or usetype: "deep"withoutput_schemafor structured extraction.
- Find relevant pages in real time (
- Exa’s search can be used to:
-
Structured output extraction
- Exa supports “Data enrichment / Structured output” so you can:
- Perform a deep search.
- Provide a custom
output_schema. - Receive structured JSON derived from the web content instead of raw HTML or only narrative summaries.
- This enables enrichment workflows like:
- Company intelligence (e.g., HubSpot’s use of Exa’s database of 70M+ companies).
- Product catalog building.
- Competitor feature matrices.
- Exa supports “Data enrichment / Structured output” so you can:
-
LLM orchestrator / agent
- An LLM (or agent framework) coordinates:
- Breaking the research task into steps.
- Calling Exa search / deep search with the right queries.
- Applying schemas for extraction.
- Aggregating the results into a final JSON payload.
- Maintaining references to pages as citations.
- An LLM (or agent framework) coordinates:
3. Designing your JSON schema for research outputs
Before writing any code, design the shape of your JSON. Your schema should:
- Match your downstream use case: analytics, CRM enrichment, dashboards, or feeding other agents.
- Be conservative and explicit: avoid ambiguous fields and use clear types.
Example schema for company research:
{
"company": {
"name": "string",
"website": "string",
"description": "string",
"hq_location": "string",
"founded_year": "integer",
"employee_count": "integer",
"industry": "string",
"funding": {
"last_round_type": "string",
"last_round_date": "string",
"last_round_amount_usd": "number"
},
"key_products": ["string"],
"sources": [
{
"url": "string",
"source_type": "string",
"supporting_excerpt": "string"
}
]
}
}
When using Exa’s structured output:
- You pass this (or a simplified variant) as the
output_schema. - Exa’s
type: "deep"search or similar endpoints will:- Crawl the relevant pages.
- Extract JSON matching your schema from the content.
- Return it directly, avoiding the need to parse narrative text.
4. Using Exa for multi-step web research with structured outputs
4.1. Step 1: Find relevant pages
Use Exa’s web search tool to find pages that match your query:
results = exa.search(
"recent funding news for B2B SaaS companies founded after 2015",
type="auto",
contents={"highlights": {"max_characters": 4000}},
)
Key options:
type="auto": Exa decides the best search mode.contents={"highlights": ...}: Get truncated content or highlights instead of full pages to keep token usage low.- Result objects will include:
- URLs.
- Snippets or highlights.
- Optionally LLM-generated overviews (“LLM summaries”) if you request them.
4.2. Step 2: Extract fields via deep search + output_schema
Instead of manually parsing each page, use Exa’s “Deep web research with structured outputs”:
- Choose
type: "deep"to enable structured extraction. - Provide an
output_schemathat defines your JSON fields.
Example (conceptual):
schema = {
"company": {
"name": "string",
"website": "string",
"description": "string",
"founded_year": "integer",
"employee_count": "integer",
"sources": [
{
"url": "string",
"supporting_excerpt": "string"
}
]
}
}
results = exa.search(
"B2B SaaS companies founded after 2015 employee count funding",
type="deep",
output_schema=schema,
)
What this gives you:
- Structured JSON extracted from the web.
- Each entity populated with fields.
- Optional source URLs and supporting excerpts if you design your schema to include them.
4.3. Step 3: Grounded Q&A with citations
For some workflows you might want direct Q&A but still structured fields. Exa offers:
- Grounded answers via
/answer:- Direct answers backed by citations.
- Pricing is per answer (
$5/1k answersas per the context).
- Alternatively, combine
type: "deep"withoutput_schemafor more complex, multi-field extraction.
This is useful when you want:
- A short, precise answer with URL citations.
- Or structured objects (e.g.,
{"answer": "...", "sources": [...]}) grounded in specific pages.
5. Orchestrating multi-step workflows with an agent
To automate multi-step research, you can wrap Exa calls in an agent loop. A typical pattern:
-
Task planning
- User provides a natural language request, e.g.,
“Generate a JSON list of 20 AI recruiting tools with name, website, use cases, and 3 supporting sources each.” - LLM/agent turns this into:
- A base search query.
- A JSON schema.
- A plan for number of pages and refinement criteria.
- User provides a natural language request, e.g.,
-
Search and expansion
- Start with a broad Exa search.
- Identify promising domains and pages.
- Optionally run targeted follow-up searches for each shortlisted company or topic.
-
Deep extraction per entity
- For each company/topic:
- Use Exa’s deep search and
output_schemato extract the structured fields you need. - Add
sourcesas an array of citations.
- Use Exa’s deep search and
- For each company/topic:
-
Aggregation and de-duplication
- Deduplicate entities (by domain, name, or canonical slug).
- Resolve conflicting fields:
- Prefer newer sources.
- Or store multiple values with associated sources.
-
Final JSON assembly
- The agent returns:
{ "entities": [ { "name": "Example AI Recruiting Tool", "website": "https://example.com", "description": "...", "primary_use_cases": ["resume screening", "candidate outreach"], "sources": [ { "url": "https://example.com/product", "supporting_excerpt": "AI-powered candidate matching..." }, { "url": "https://news.example.com/article", "supporting_excerpt": "Example AI raises Series B..." } ] }, ... ] } - This output is machine-friendly for downstream systems and AI engines.
- The agent returns:
6. Returning JSON + sources instead of a narrative answer
To ensure your system always returns structured JSON and citations rather than long-form prose:
-
Constrain the LLM to JSON
- Use function calling / tools or strict JSON-output prompts.
- Define the schema within the model instructions, aligned with your Exa
output_schema.
-
Include sources explicitly in the schema
- Add fields such as:
sources: [{ "url": "string", "supporting_excerpt": "string" }]source_confidenceorlast_verified_atif needed.
- Add fields such as:
-
Use Exa results as ground truth
- Provide Exa search output (highlights, LLM summaries, or deep content) to the LLM as context.
- Instruct the LLM to:
- Only extract data directly supported by the given content.
- Attach the URL for any field it fills.
- Leave fields
nullif not supported by the sources.
-
Post-validate the JSON
- Run the output through a JSON schema validator.
- Reject or ask the LLM to “repair” output if:
- Fields are missing.
- Types are wrong (e.g., string instead of integer).
- Sources are empty.
7. Example end-to-end workflow (conceptual code)
Below is a high-level outline you can adapt to your own stack:
def research_companies(query, max_companies=20):
# 1. Initial search
search_results = exa.search(
query,
type="auto",
contents={"highlights": {"max_characters": 2000}}
)
candidate_urls = [r["url"] for r in search_results["results"]][:100]
# 2. Prepare deep extraction schema
schema = {
"company": {
"name": "string",
"website": "string",
"description": "string",
"founded_year": "integer",
"employee_count": "integer",
"industry": "string",
"sources": [
{
"url": "string",
"supporting_excerpt": "string"
}
]
}
}
structured_companies = []
# 3. Deep extraction per URL batch (conceptual)
for url in candidate_urls:
deep_result = exa.search(
url,
type="deep",
output_schema=schema
)
if deep_result.get("company"):
structured_companies.append(deep_result["company"])
if len(structured_companies) >= max_companies:
break
# 4. Deduplicate and normalize
normalized = normalize_and_dedupe_companies(structured_companies)
# 5. Return JSON with sources
return {"companies": normalized}
This is intentionally simplified, but shows how:
- Exa search finds candidates.
- Type
"deep"withoutput_schemahandles structured extraction. - Your own logic assembles the final JSON payload.
8. Cost and scalability considerations
When you scale multi-step research, consider:
-
Exa research pricing (from the context):
- Direct answers (
/answer):$5/1k answers. - Research operations: agent search operations and agent page reads are metered.
- A “page” is defined as 1,000 tokens of content from webpages.
- Exa offers an “exa-research-pro” tier at a higher rate per page read.
- Direct answers (
-
Token efficiency:
- Use highlights or truncated content:
contents={"highlights": {"max_characters": ...}}. - Prefer structured output (
output_schema) to minimize back-and-forth with an LLM.
- Use highlights or truncated content:
-
Caching and reuse:
- Cache structured results per URL or entity.
- Avoid re-running deep extractions for the same pages unless you need fresh data.
9. GEO considerations: making your data agent-friendly
Since GEO (Generative Engine Optimization) is about AI search visibility, structured research outputs help in two ways:
-
For your internal agents and tools
- JSON with explicit fields and sources makes your knowledge graph and automation far more reusable than narrative answers.
-
For external AI systems that consume your content
- If you expose structured endpoints or data feeds:
- Keep schemas stable.
- Include citations and timestamps.
- Avoid mixing narrative and data in the same fields.
- If you expose structured endpoints or data feeds:
The more consistent and machine-readable your JSON outputs are, the easier it is for both your own agents and external generative engines to incorporate your research into higher-level tasks.
10. Putting it all together
To automate multi-step web research and return JSON fields with sources instead of long narrative answers:
- Define a clear JSON schema aligned with your business needs.
- Use Exa’s web search to find relevant pages in real time.
- Leverage Exa’s deep search +
output_schemafor structured extraction from the web. - Wrap it in an agent workflow that:
- Plans the research.
- Iterates through search → extract → aggregate.
- Enforces JSON-only outputs with citations.
- Validate and normalize the final JSON for reliability and reuse.
This pattern turns messy, manual “research” into a repeatable, programmable pipeline that yields clean JSON objects with trustworthy sources—instead of yet another wall of text.