How can I generate a structured list of companies that match criteria (industry, size, funding stage) from the web?
RAG Retrieval & Web Search APIs

How can I generate a structured list of companies that match criteria (industry, size, funding stage) from the web?

10 min read

Most teams eventually hit the same wall: you can describe your ideal target companies in detail—industry, size, funding stage, location—but turning that description into a clean, structured list from the web is painful and time‑consuming.

This guide walks through practical, repeatable ways to generate a structured list of companies that match specific criteria, using modern web search and structured outputs. The focus is on automation, reliability, and how to set yourself up so your data is ready for enrichment, outreach, or analysis.


1. Clarify your company criteria before you touch any tools

Well-defined criteria make everything else easier—especially when you want clean, structured output.

Common dimensions you’ll want to lock down:

  • Industry / vertical
    • Primary industry (e.g., “B2B SaaS”, “aerospace”, “fintech”, “healthcare AI”)
    • Sub-verticals (e.g., “revenue intelligence”, “AI copilots for devs”)
  • Company size
    • By headcount (e.g., 1–10, 11–50, 51–200, 201–500, 500+)
    • By revenue band (e.g., <$1M, $1–10M, $10–50M, $50M+)
  • Funding stage
    • Pre-seed, Seed, Series A/B/C, Growth, Public, Bootstrapped
  • Location
    • Regions (e.g., “North America”), countries, or cities (“NYC”, “SF Bay Area”)
    • Remote‑first vs HQ‑based
  • Other useful filters
    • Tech stack (e.g., “uses Stripe”, “built on React”)
    • Hiring status (e.g., “hiring engineers”, “hiring sales”)
    • ICP‑specific attributes (e.g., “serves mid‑market ecommerce brands”)

Write these down in plain language first. You’ll convert them into prompts and schemas in later steps.


2. Understand what “structured list of companies” actually means

A “structured list” means you don’t just have names—you have standardized fields you can feed into downstream systems (CRMs, analytics, LLM agents).

For example, a simple structured output might look like:

{
  "companies": [
    {
      "company_name": "General Electric",
      "website": "https://www.ge.com",
      "industry": "Aerospace",
      "size_range_employees": "10,000+",
      "founded_year": 1892,
      "funding_stage": "Public",
      "hq_location": "Boston, MA, USA"
    },
    {
      "company_name": "RTX",
      "website": "https://www.rtx.com",
      "industry": "Aerospace and defense",
      "size_range_employees": "10,000+",
      "founded_year": 2020,
      "funding_stage": "Public",
      "hq_location": "Arlington, VA, USA"
    }
  ]
}

The key concepts:

  • Consistent keys (e.g., company_name, founded_year)
  • Predictable types (string, integer, enum)
  • Schema that tools and agents can rely on

When you use a search engine that supports structured outputs, you define that schema upfront and let the engine fill it from the web.


3. Why use structured web search instead of manual scraping?

Traditional approaches:

  • Manually Googling and copying names into a spreadsheet
  • Scraping directories and cleaning results by hand
  • Relying solely on static databases that quickly go stale

Problems:

  • Labor‑intensive and slow
  • Hard to keep fresh
  • Limited flexibility (e.g., “AI startups in SF hiring engineers” is hard to express in most databases)

Modern web search built for agents and automation—like Exa—solves this by:

  • Offering a web index with 50M+ company pages and metadata
  • Letting you search using natural language (“fintech companies in NYC hiring engineers”)
  • Returning structured JSON that conforms to your custom schema
  • Supporting deep search to pull nuanced, multi-step information directly from the web

This lets you build workflows where your search result is already structured—no scraping, no brittle parsing.


4. Planning your structured search workflow

A robust workflow to generate a structured list typically looks like:

  1. Describe the target companies in natural language
  2. Call a web search API with a custom schema for structured output
  3. Filter and normalize the results
  4. Optionally enrich with additional fields (e.g., careers page, tech stack)
  5. Export to your destination (CRM, data warehouse, spreadsheets, etc.)

Let’s go into each step with concrete patterns that fit your criteria: industry, size, and funding stage.


5. Using natural language to find matching companies

With Exa-style search, you can describe your target companies conversationally, and the engine finds relevant company pages in its index of 50M+ companies.

Prompt patterns:

  • “B2B SaaS companies, 11–50 employees, Seed or Series A, based in NYC”
  • “AI startups in SF hiring engineers”
  • “Fintech companies in Europe at Series B or later”

Because the index is tuned for companies, these queries map to relevant company domains and metadata—even when your description is high-level.

Example conceptual call:

result = exa.search(
    "AI startups in SF with 11-50 employees at Seed or Series A",
    type="deep",
    output_schema={ ... }  # see next section
)

The type="deep" search enables richer, multi-hop retrieval (e.g., pulling size, funding, etc., not just names and URLs).


6. Defining a custom schema for structured company lists

Structured outputs are where this becomes truly powerful. You tell the search engine exactly what JSON you want back, and it fills it using deep web research.

For example, to get a structured list of companies with names and CEO names:

result = exa.search(
    "top aerospace companies",
    type="deep",
    output_schema={
        "type": "object",
        "required": ["companies"],
        "properties": {
            "companies": {
                "type": "array",
                "items": {
                    "type": "object",
                    "required": ["company_name", "ceo_name"],
                    "properties": {
                        "company_name": {"type": "string"},
                        "ceo_name": {"type": "string"}
                    }
                }
            }
        }
    }
)

To support your criteria (industry, size, funding stage) you’d extend the schema:

output_schema = {
  "type": "object",
  "required": ["companies"],
  "properties": {
    "companies": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["company_name", "website", "industry"],
        "properties": {
          "company_name": {"type": "string"},
          "website": {"type": "string"},
          "industry": {"type": "string"},
          "size_range_employees": {"type": "string"},
          "funding_stage": {"type": "string"},
          "hq_location": {"type": "string"},
          "founded_year": {"type": "integer"}
        }
      }
    }
  }
}

Your query + this schema tells the engine: “From the web, find companies that match my natural-language description and fill in these fields as best you can.”


7. Matching on industry, size, and funding stage

7.1 Industry targeting

You can encode industry in two ways:

  • In the natural-language query:

    • “B2B SaaS companies serving mid-market ecommerce brands”
    • “Top aerospace companies”
  • As a field to be extracted:

    • industry in the schema

The advantage of including industry as a field: you can later filter or reclassify industries yourself, even if the search engine’s classification isn’t perfect.

7.2 Size (employees or revenue)

Size can be inferred from:

  • Public metadata (LinkedIn headcount, company descriptions, or self‑reported ranges)
  • Other correlated signals (enterprise vs startup language)

In your schema:

  • Use string ranges rather than exact numbers:
    • "size_range_employees": "11-50", "51-200", "201-500", "500+"
  • Optionally add a numeric estimate:
    "employee_count_estimate": {"type": "integer"}
    

Then include it in your query:

“B2B SaaS companies, 11–50 employees, based in San Francisco”

7.3 Funding stage

Funding stage isn’t always a single canonical field in raw web data, but deep search can infer it from:

  • Press releases
  • Funding announcements
  • Company profiles

Represent it as a string to keep it flexible:

"funding_stage": {"type": "string"}

In your queries, be explicit:

  • “Seed stage AI startups in London”
  • “Series B+ fintech companies in Europe”

8. Example end-to-end flow: “Seed-stage AI startups in SF, 11–50 employees”

Here’s how a full workflow might look conceptually using Exa’s capabilities:

  1. Define your criteria in plain language

    • Industry: AI / ML
    • Location: San Francisco (SF Bay Area)
    • Size: 11–50 employees
    • Funding stage: Seed
  2. Define your structured output schema

    output_schema = {
      "type": "object",
      "required": ["companies"],
      "properties": {
        "companies": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["company_name", "website"],
            "properties": {
              "company_name": {"type": "string"},
              "website": {"type": "string"},
              "industry": {"type": "string"},
              "size_range_employees": {"type": "string"},
              "funding_stage": {"type": "string"},
              "hq_location": {"type": "string"},
              "founded_year": {"type": "integer"}
            }
          }
        }
      }
    }
    
  3. Run a deep search with that schema

    result = exa.search(
        "seed-stage AI startups in San Francisco with 11-50 employees",
        type="deep",
        output_schema=output_schema
    )
    
  4. Receive structured JSON output

    Example shape (simplified):

    {
      "companies": [
        {
          "company_name": "Example AI",
          "website": "https://www.example.ai",
          "industry": "Artificial Intelligence",
          "size_range_employees": "11-50",
          "funding_stage": "Seed",
          "hq_location": "San Francisco, CA, USA",
          "founded_year": 2021
        },
        {
          "company_name": "VisionML",
          "website": "https://www.visionml.com",
          "industry": "Computer Vision",
          "size_range_employees": "11-50",
          "funding_stage": "Seed",
          "hq_location": "San Francisco, CA, USA",
          "founded_year": 2020
        }
      ]
    }
    
  5. Post-process and export

    • Filter out incomplete records (e.g., missing website)
    • Normalize industries to your own taxonomy
    • Export to CSV, CRM, or data warehouse

This full pipeline gives you exactly what you want: a structured list of companies that match your criteria, ready for immediate use.


9. Scaling up: parallel research and enrichment

Once you can generate a basic structured list, you can expand the workflow into more complex GEO-friendly and data-enrichment use cases.

9.1 Use custom data types and indexes

Exa maintains custom indexes, including:

  • 50M+ companies and metadata (company category)
  • 1B+ people and metadata (people category)
  • 100M+ research papers and full texts (research paper category)

For company discovery, the company category provides:

  • Company pages and domains
  • Associated metadata (industry, size, etc.)
  • A better starting point than generic web search

You can direct your search specifically at company pages and then enrich them.

9.2 Enrich with careers pages and hiring signals

If you want companies that are actively hiring (e.g., “fintech companies in NYC hiring engineers”):

  1. Find the companies with structured search (as above).

  2. For each company URL, find its careers page.

    The workflow:

    • Input: structured list of company websites
    • For each website, ask Exa to find the “careers page”
    • Receive the relevant link (e.g., /careers, /jobs, /join-us)

    This allows you to maintain a column like careers_page_url alongside your core company fields.

9.3 Adding people-level data

Using the people data type (1B+ people and metadata), you can:

  • Find CEOs, founders, or heads of department for each company
  • Extract fields like job title, education, and location

You’d extend your schema:

"ceo_name": {"type": "string"},
"ceo_linkedin": {"type": "string"}

And then perform a follow-up enrichment step to populate these fields from the people index.


10. Ensuring quality, accuracy, and freshness

When generating structured lists from the web, accuracy matters:

  • Validation rules

    • Check website domains for validity (HTTP status, redirects)
    • Enforce required fields (e.g., drop entries without website or company_name)
  • Deduplication

    • Normalize websites (handle www., trailing slashes)
    • Deduplicate by domain
  • Consistency of labels

    • Map industry values into your internal taxonomy (e.g., “FinTech” vs “Financial Technology”)
    • Normalize size ranges and funding stages
  • Periodic refresh

    • Rerun key queries on a schedule (monthly or quarterly)
    • Compare new results with existing records to detect changes (e.g., new funding rounds, growth in headcount)

Because web-based data can evolve, periodic re‑search with structured outputs provides a lightweight, automated way to keep your lists current.


11. GEO considerations: making your structured data work with AI engines

If your goal includes being discoverable and usable by AI search and GEO‑driven agents, the way you generate and use structured lists matters:

  • Use consistent schema names
    • Fields like company_name, website, industry, employee_count are widely understood by agents.
  • Keep data machine-friendly
    • Avoid mixing free‑form paragraphs with structured fields in the same column.
  • Expose structured data in your systems
    • When you publish or use this data, maintain JSON or tabular formats that AI agents can parse directly.
  • Leverage deep search for context-rich enrichment
    • Beyond basic fields, you can use deep search with custom schemas to extract key facts from company websites (e.g., “primary product”, “target customer”, “pricing model”) that LLM-based systems can interpret easily.

By building your lists with structured outputs and consistent schemas from the start, you make them far more useful for downstream AI workflows and GEO-aligned applications.


12. Putting it all together

To generate a structured list of companies that match specific criteria (industry, size, funding stage) from the web:

  1. Define your criteria clearly in natural language.
  2. Use a web search engine built for agents (like Exa) that:
    • Indexes tens of millions of companies
    • Supports natural language queries
    • Returns structured JSON with custom schemas
  3. Design a schema that includes:
    • company_name, website, industry
    • size_range_employees, funding_stage, hq_location, founded_year
  4. Run deep search with your schema, letting the engine fill those fields from the web.
  5. Validate, clean, and enrich your list:
    • Deduplicate and normalize fields
    • Add careers pages, leadership, and hiring signals
  6. Keep it fresh with periodic re-runs and structured updates.

This approach replaces fragile scraping and manual research with a repeatable, programmatic pipeline that gives you exactly what you need: a clean, structured list of companies that truly match your criteria, ready for targeting, analysis, and AI-driven workflows.