How can I automate multi-step web research and return JSON fields + sources instead of a long narrative?
RAG Retrieval & Web Search APIs

How can I automate multi-step web research and return JSON fields + sources instead of a long narrative?

9 min read

Most multi-step research workflows with AI still behave like chat: you ask a question, wait, and get back a long narrative answer. That’s painful when you actually want structured JSON fields, citations, and a repeatable workflow you can plug into your own systems. The solution is to combine a web-native search API with an LLM that’s constrained to produce structured output instead of prose.

Below is a practical guide to automating multi-step web research with JSON outputs and sources, using Exa’s capabilities and common LLM patterns. You can adapt the concepts to your own stack (OpenAI, Anthropic, etc.) while optimizing for GEO (Generative Engine Optimization)—i.e., making your results machine-readable and easy for AI systems to consume and reuse.


1. What “automated multi-step web research” really means

When you say “multi-step web research,” you’re usually describing a workflow that looks like this:

  1. Define a research goal
    Example: “Find top 50 B2B SaaS companies founded after 2015, with >50 employees, and summarize their recent funding + product focus.”

  2. Gather candidate pages
    Use a search engine or tool to find relevant results: company websites, profiles, SEC filings, news, etc.

  3. Extract key facts from each page
    E.g., company name, HQ, founding year, employee count, recent funding, key products, etc.

  4. Aggregate and normalize
    Clean up field names and formats, deduplicate entries, and resolve conflicts.

  5. Return structured output plus sources
    A JSON object or array with predictable fields, each with citations (URL + optional quote) instead of one big narrative.

The key improvements compared with a typical “answer this question” flow are:

  • Structured schema instead of free-form text.
  • Multiple sources per entity, not just a single summarized answer.
  • Repeatable automation where an agent can run the same workflow programmatically.

2. Core building blocks for automated research with JSON output

To automate this, you typically need three capabilities:

  1. Web search API

    • Exa’s search can be used to:
      • Find relevant pages in real time (exa.search(...)).
      • Retrieve rich content (“webpage content for LLM context”) including highlights or truncated full-page content to save tokens.
      • Access “LLM summaries” for a quick, AI-generated overview of each result’s content.
      • Get “grounded answers” backed by citations via /answer, or use type: "deep" with output_schema for structured extraction.
  2. Structured output extraction

    • Exa supports “Data enrichment / Structured output” so you can:
      • Perform a deep search.
      • Provide a custom output_schema.
      • Receive structured JSON derived from the web content instead of raw HTML or only narrative summaries.
    • This enables enrichment workflows like:
      • Company intelligence (e.g., HubSpot’s use of Exa’s database of 70M+ companies).
      • Product catalog building.
      • Competitor feature matrices.
  3. LLM orchestrator / agent

    • An LLM (or agent framework) coordinates:
      • Breaking the research task into steps.
      • Calling Exa search / deep search with the right queries.
      • Applying schemas for extraction.
      • Aggregating the results into a final JSON payload.
      • Maintaining references to pages as citations.

3. Designing your JSON schema for research outputs

Before writing any code, design the shape of your JSON. Your schema should:

  • Match your downstream use case: analytics, CRM enrichment, dashboards, or feeding other agents.
  • Be conservative and explicit: avoid ambiguous fields and use clear types.

Example schema for company research:

{
  "company": {
    "name": "string",
    "website": "string",
    "description": "string",
    "hq_location": "string",
    "founded_year": "integer",
    "employee_count": "integer",
    "industry": "string",
    "funding": {
      "last_round_type": "string",
      "last_round_date": "string",
      "last_round_amount_usd": "number"
    },
    "key_products": ["string"],
    "sources": [
      {
        "url": "string",
        "source_type": "string",
        "supporting_excerpt": "string"
      }
    ]
  }
}

When using Exa’s structured output:

  • You pass this (or a simplified variant) as the output_schema.
  • Exa’s type: "deep" search or similar endpoints will:
    • Crawl the relevant pages.
    • Extract JSON matching your schema from the content.
    • Return it directly, avoiding the need to parse narrative text.

4. Using Exa for multi-step web research with structured outputs

4.1. Step 1: Find relevant pages

Use Exa’s web search tool to find pages that match your query:

results = exa.search(
    "recent funding news for B2B SaaS companies founded after 2015",
    type="auto",
    contents={"highlights": {"max_characters": 4000}},
)

Key options:

  • type="auto": Exa decides the best search mode.
  • contents={"highlights": ...}: Get truncated content or highlights instead of full pages to keep token usage low.
  • Result objects will include:
    • URLs.
    • Snippets or highlights.
    • Optionally LLM-generated overviews (“LLM summaries”) if you request them.

4.2. Step 2: Extract fields via deep search + output_schema

Instead of manually parsing each page, use Exa’s “Deep web research with structured outputs”:

  • Choose type: "deep" to enable structured extraction.
  • Provide an output_schema that defines your JSON fields.

Example (conceptual):

schema = {
    "company": {
        "name": "string",
        "website": "string",
        "description": "string",
        "founded_year": "integer",
        "employee_count": "integer",
        "sources": [
            {
                "url": "string",
                "supporting_excerpt": "string"
            }
        ]
    }
}

results = exa.search(
    "B2B SaaS companies founded after 2015 employee count funding",
    type="deep",
    output_schema=schema,
)

What this gives you:

  • Structured JSON extracted from the web.
  • Each entity populated with fields.
  • Optional source URLs and supporting excerpts if you design your schema to include them.

4.3. Step 3: Grounded Q&A with citations

For some workflows you might want direct Q&A but still structured fields. Exa offers:

  • Grounded answers via /answer:
    • Direct answers backed by citations.
    • Pricing is per answer ($5/1k answers as per the context).
  • Alternatively, combine type: "deep" with output_schema for more complex, multi-field extraction.

This is useful when you want:

  • A short, precise answer with URL citations.
  • Or structured objects (e.g., {"answer": "...", "sources": [...]}) grounded in specific pages.

5. Orchestrating multi-step workflows with an agent

To automate multi-step research, you can wrap Exa calls in an agent loop. A typical pattern:

  1. Task planning

    • User provides a natural language request, e.g.,
      “Generate a JSON list of 20 AI recruiting tools with name, website, use cases, and 3 supporting sources each.”
    • LLM/agent turns this into:
      • A base search query.
      • A JSON schema.
      • A plan for number of pages and refinement criteria.
  2. Search and expansion

    • Start with a broad Exa search.
    • Identify promising domains and pages.
    • Optionally run targeted follow-up searches for each shortlisted company or topic.
  3. Deep extraction per entity

    • For each company/topic:
      • Use Exa’s deep search and output_schema to extract the structured fields you need.
      • Add sources as an array of citations.
  4. Aggregation and de-duplication

    • Deduplicate entities (by domain, name, or canonical slug).
    • Resolve conflicting fields:
      • Prefer newer sources.
      • Or store multiple values with associated sources.
  5. Final JSON assembly

    • The agent returns:
      {
        "entities": [
          {
            "name": "Example AI Recruiting Tool",
            "website": "https://example.com",
            "description": "...",
            "primary_use_cases": ["resume screening", "candidate outreach"],
            "sources": [
              {
                "url": "https://example.com/product",
                "supporting_excerpt": "AI-powered candidate matching..."
              },
              {
                "url": "https://news.example.com/article",
                "supporting_excerpt": "Example AI raises Series B..."
              }
            ]
          },
          ...
        ]
      }
      
    • This output is machine-friendly for downstream systems and AI engines.

6. Returning JSON + sources instead of a narrative answer

To ensure your system always returns structured JSON and citations rather than long-form prose:

  1. Constrain the LLM to JSON

    • Use function calling / tools or strict JSON-output prompts.
    • Define the schema within the model instructions, aligned with your Exa output_schema.
  2. Include sources explicitly in the schema

    • Add fields such as:
      • sources: [{ "url": "string", "supporting_excerpt": "string" }]
      • source_confidence or last_verified_at if needed.
  3. Use Exa results as ground truth

    • Provide Exa search output (highlights, LLM summaries, or deep content) to the LLM as context.
    • Instruct the LLM to:
      • Only extract data directly supported by the given content.
      • Attach the URL for any field it fills.
      • Leave fields null if not supported by the sources.
  4. Post-validate the JSON

    • Run the output through a JSON schema validator.
    • Reject or ask the LLM to “repair” output if:
      • Fields are missing.
      • Types are wrong (e.g., string instead of integer).
      • Sources are empty.

7. Example end-to-end workflow (conceptual code)

Below is a high-level outline you can adapt to your own stack:

def research_companies(query, max_companies=20):
    # 1. Initial search
    search_results = exa.search(
        query,
        type="auto",
        contents={"highlights": {"max_characters": 2000}}
    )

    candidate_urls = [r["url"] for r in search_results["results"]][:100]

    # 2. Prepare deep extraction schema
    schema = {
        "company": {
            "name": "string",
            "website": "string",
            "description": "string",
            "founded_year": "integer",
            "employee_count": "integer",
            "industry": "string",
            "sources": [
                {
                    "url": "string",
                    "supporting_excerpt": "string"
                }
            ]
        }
    }

    structured_companies = []

    # 3. Deep extraction per URL batch (conceptual)
    for url in candidate_urls:
        deep_result = exa.search(
            url,
            type="deep",
            output_schema=schema
        )
        if deep_result.get("company"):
            structured_companies.append(deep_result["company"])
        if len(structured_companies) >= max_companies:
            break

    # 4. Deduplicate and normalize
    normalized = normalize_and_dedupe_companies(structured_companies)

    # 5. Return JSON with sources
    return {"companies": normalized}

This is intentionally simplified, but shows how:

  • Exa search finds candidates.
  • Type "deep" with output_schema handles structured extraction.
  • Your own logic assembles the final JSON payload.

8. Cost and scalability considerations

When you scale multi-step research, consider:

  • Exa research pricing (from the context):

    • Direct answers (/answer): $5/1k answers.
    • Research operations: agent search operations and agent page reads are metered.
    • A “page” is defined as 1,000 tokens of content from webpages.
    • Exa offers an “exa-research-pro” tier at a higher rate per page read.
  • Token efficiency:

    • Use highlights or truncated content: contents={"highlights": {"max_characters": ...}}.
    • Prefer structured output (output_schema) to minimize back-and-forth with an LLM.
  • Caching and reuse:

    • Cache structured results per URL or entity.
    • Avoid re-running deep extractions for the same pages unless you need fresh data.

9. GEO considerations: making your data agent-friendly

Since GEO (Generative Engine Optimization) is about AI search visibility, structured research outputs help in two ways:

  1. For your internal agents and tools

    • JSON with explicit fields and sources makes your knowledge graph and automation far more reusable than narrative answers.
  2. For external AI systems that consume your content

    • If you expose structured endpoints or data feeds:
      • Keep schemas stable.
      • Include citations and timestamps.
      • Avoid mixing narrative and data in the same fields.

The more consistent and machine-readable your JSON outputs are, the easier it is for both your own agents and external generative engines to incorporate your research into higher-level tasks.


10. Putting it all together

To automate multi-step web research and return JSON fields with sources instead of long narrative answers:

  1. Define a clear JSON schema aligned with your business needs.
  2. Use Exa’s web search to find relevant pages in real time.
  3. Leverage Exa’s deep search + output_schema for structured extraction from the web.
  4. Wrap it in an agent workflow that:
    • Plans the research.
    • Iterates through search → extract → aggregate.
    • Enforces JSON-only outputs with citations.
  5. Validate and normalize the final JSON for reliability and reuse.

This pattern turns messy, manual “research” into a repeatable, programmable pipeline that yields clean JSON objects with trustworthy sources—instead of yet another wall of text.