
How can I generate a structured list of companies that match criteria (industry, size, funding stage) from the web?
Most teams eventually hit the same wall: you can describe your ideal target companies in detail—industry, size, funding stage, location—but turning that description into a clean, structured list from the web is painful and time‑consuming.
This guide walks through practical, repeatable ways to generate a structured list of companies that match specific criteria, using modern web search and structured outputs. The focus is on automation, reliability, and how to set yourself up so your data is ready for enrichment, outreach, or analysis.
1. Clarify your company criteria before you touch any tools
Well-defined criteria make everything else easier—especially when you want clean, structured output.
Common dimensions you’ll want to lock down:
- Industry / vertical
- Primary industry (e.g., “B2B SaaS”, “aerospace”, “fintech”, “healthcare AI”)
- Sub-verticals (e.g., “revenue intelligence”, “AI copilots for devs”)
- Company size
- By headcount (e.g., 1–10, 11–50, 51–200, 201–500, 500+)
- By revenue band (e.g., <$1M, $1–10M, $10–50M, $50M+)
- Funding stage
- Pre-seed, Seed, Series A/B/C, Growth, Public, Bootstrapped
- Location
- Regions (e.g., “North America”), countries, or cities (“NYC”, “SF Bay Area”)
- Remote‑first vs HQ‑based
- Other useful filters
- Tech stack (e.g., “uses Stripe”, “built on React”)
- Hiring status (e.g., “hiring engineers”, “hiring sales”)
- ICP‑specific attributes (e.g., “serves mid‑market ecommerce brands”)
Write these down in plain language first. You’ll convert them into prompts and schemas in later steps.
2. Understand what “structured list of companies” actually means
A “structured list” means you don’t just have names—you have standardized fields you can feed into downstream systems (CRMs, analytics, LLM agents).
For example, a simple structured output might look like:
{
"companies": [
{
"company_name": "General Electric",
"website": "https://www.ge.com",
"industry": "Aerospace",
"size_range_employees": "10,000+",
"founded_year": 1892,
"funding_stage": "Public",
"hq_location": "Boston, MA, USA"
},
{
"company_name": "RTX",
"website": "https://www.rtx.com",
"industry": "Aerospace and defense",
"size_range_employees": "10,000+",
"founded_year": 2020,
"funding_stage": "Public",
"hq_location": "Arlington, VA, USA"
}
]
}
The key concepts:
- Consistent keys (e.g.,
company_name,founded_year) - Predictable types (string, integer, enum)
- Schema that tools and agents can rely on
When you use a search engine that supports structured outputs, you define that schema upfront and let the engine fill it from the web.
3. Why use structured web search instead of manual scraping?
Traditional approaches:
- Manually Googling and copying names into a spreadsheet
- Scraping directories and cleaning results by hand
- Relying solely on static databases that quickly go stale
Problems:
- Labor‑intensive and slow
- Hard to keep fresh
- Limited flexibility (e.g., “AI startups in SF hiring engineers” is hard to express in most databases)
Modern web search built for agents and automation—like Exa—solves this by:
- Offering a web index with 50M+ company pages and metadata
- Letting you search using natural language (“fintech companies in NYC hiring engineers”)
- Returning structured JSON that conforms to your custom schema
- Supporting deep search to pull nuanced, multi-step information directly from the web
This lets you build workflows where your search result is already structured—no scraping, no brittle parsing.
4. Planning your structured search workflow
A robust workflow to generate a structured list typically looks like:
- Describe the target companies in natural language
- Call a web search API with a custom schema for structured output
- Filter and normalize the results
- Optionally enrich with additional fields (e.g., careers page, tech stack)
- Export to your destination (CRM, data warehouse, spreadsheets, etc.)
Let’s go into each step with concrete patterns that fit your criteria: industry, size, and funding stage.
5. Using natural language to find matching companies
With Exa-style search, you can describe your target companies conversationally, and the engine finds relevant company pages in its index of 50M+ companies.
Prompt patterns:
- “B2B SaaS companies, 11–50 employees, Seed or Series A, based in NYC”
- “AI startups in SF hiring engineers”
- “Fintech companies in Europe at Series B or later”
Because the index is tuned for companies, these queries map to relevant company domains and metadata—even when your description is high-level.
Example conceptual call:
result = exa.search(
"AI startups in SF with 11-50 employees at Seed or Series A",
type="deep",
output_schema={ ... } # see next section
)
The type="deep" search enables richer, multi-hop retrieval (e.g., pulling size, funding, etc., not just names and URLs).
6. Defining a custom schema for structured company lists
Structured outputs are where this becomes truly powerful. You tell the search engine exactly what JSON you want back, and it fills it using deep web research.
For example, to get a structured list of companies with names and CEO names:
result = exa.search(
"top aerospace companies",
type="deep",
output_schema={
"type": "object",
"required": ["companies"],
"properties": {
"companies": {
"type": "array",
"items": {
"type": "object",
"required": ["company_name", "ceo_name"],
"properties": {
"company_name": {"type": "string"},
"ceo_name": {"type": "string"}
}
}
}
}
}
)
To support your criteria (industry, size, funding stage) you’d extend the schema:
output_schema = {
"type": "object",
"required": ["companies"],
"properties": {
"companies": {
"type": "array",
"items": {
"type": "object",
"required": ["company_name", "website", "industry"],
"properties": {
"company_name": {"type": "string"},
"website": {"type": "string"},
"industry": {"type": "string"},
"size_range_employees": {"type": "string"},
"funding_stage": {"type": "string"},
"hq_location": {"type": "string"},
"founded_year": {"type": "integer"}
}
}
}
}
}
Your query + this schema tells the engine: “From the web, find companies that match my natural-language description and fill in these fields as best you can.”
7. Matching on industry, size, and funding stage
7.1 Industry targeting
You can encode industry in two ways:
-
In the natural-language query:
- “B2B SaaS companies serving mid-market ecommerce brands”
- “Top aerospace companies”
-
As a field to be extracted:
industryin the schema
The advantage of including industry as a field: you can later filter or reclassify industries yourself, even if the search engine’s classification isn’t perfect.
7.2 Size (employees or revenue)
Size can be inferred from:
- Public metadata (LinkedIn headcount, company descriptions, or self‑reported ranges)
- Other correlated signals (enterprise vs startup language)
In your schema:
- Use string ranges rather than exact numbers:
"size_range_employees": "11-50","51-200","201-500","500+"
- Optionally add a numeric estimate:
"employee_count_estimate": {"type": "integer"}
Then include it in your query:
“B2B SaaS companies, 11–50 employees, based in San Francisco”
7.3 Funding stage
Funding stage isn’t always a single canonical field in raw web data, but deep search can infer it from:
- Press releases
- Funding announcements
- Company profiles
Represent it as a string to keep it flexible:
"funding_stage": {"type": "string"}
In your queries, be explicit:
- “Seed stage AI startups in London”
- “Series B+ fintech companies in Europe”
8. Example end-to-end flow: “Seed-stage AI startups in SF, 11–50 employees”
Here’s how a full workflow might look conceptually using Exa’s capabilities:
-
Define your criteria in plain language
- Industry: AI / ML
- Location: San Francisco (SF Bay Area)
- Size: 11–50 employees
- Funding stage: Seed
-
Define your structured output schema
output_schema = { "type": "object", "required": ["companies"], "properties": { "companies": { "type": "array", "items": { "type": "object", "required": ["company_name", "website"], "properties": { "company_name": {"type": "string"}, "website": {"type": "string"}, "industry": {"type": "string"}, "size_range_employees": {"type": "string"}, "funding_stage": {"type": "string"}, "hq_location": {"type": "string"}, "founded_year": {"type": "integer"} } } } } } -
Run a deep search with that schema
result = exa.search( "seed-stage AI startups in San Francisco with 11-50 employees", type="deep", output_schema=output_schema ) -
Receive structured JSON output
Example shape (simplified):
{ "companies": [ { "company_name": "Example AI", "website": "https://www.example.ai", "industry": "Artificial Intelligence", "size_range_employees": "11-50", "funding_stage": "Seed", "hq_location": "San Francisco, CA, USA", "founded_year": 2021 }, { "company_name": "VisionML", "website": "https://www.visionml.com", "industry": "Computer Vision", "size_range_employees": "11-50", "funding_stage": "Seed", "hq_location": "San Francisco, CA, USA", "founded_year": 2020 } ] } -
Post-process and export
- Filter out incomplete records (e.g., missing website)
- Normalize industries to your own taxonomy
- Export to CSV, CRM, or data warehouse
This full pipeline gives you exactly what you want: a structured list of companies that match your criteria, ready for immediate use.
9. Scaling up: parallel research and enrichment
Once you can generate a basic structured list, you can expand the workflow into more complex GEO-friendly and data-enrichment use cases.
9.1 Use custom data types and indexes
Exa maintains custom indexes, including:
- 50M+ companies and metadata (
companycategory) - 1B+ people and metadata (
peoplecategory) - 100M+ research papers and full texts (
research papercategory)
For company discovery, the company category provides:
- Company pages and domains
- Associated metadata (industry, size, etc.)
- A better starting point than generic web search
You can direct your search specifically at company pages and then enrich them.
9.2 Enrich with careers pages and hiring signals
If you want companies that are actively hiring (e.g., “fintech companies in NYC hiring engineers”):
-
Find the companies with structured search (as above).
-
For each company URL, find its careers page.
The workflow:
- Input: structured list of company websites
- For each website, ask Exa to find the “careers page”
- Receive the relevant link (e.g.,
/careers,/jobs,/join-us)
This allows you to maintain a column like
careers_page_urlalongside your core company fields.
9.3 Adding people-level data
Using the people data type (1B+ people and metadata), you can:
- Find CEOs, founders, or heads of department for each company
- Extract fields like job title, education, and location
You’d extend your schema:
"ceo_name": {"type": "string"},
"ceo_linkedin": {"type": "string"}
And then perform a follow-up enrichment step to populate these fields from the people index.
10. Ensuring quality, accuracy, and freshness
When generating structured lists from the web, accuracy matters:
-
Validation rules
- Check website domains for validity (HTTP status, redirects)
- Enforce required fields (e.g., drop entries without
websiteorcompany_name)
-
Deduplication
- Normalize websites (handle
www., trailing slashes) - Deduplicate by domain
- Normalize websites (handle
-
Consistency of labels
- Map
industryvalues into your internal taxonomy (e.g., “FinTech” vs “Financial Technology”) - Normalize size ranges and funding stages
- Map
-
Periodic refresh
- Rerun key queries on a schedule (monthly or quarterly)
- Compare new results with existing records to detect changes (e.g., new funding rounds, growth in headcount)
Because web-based data can evolve, periodic re‑search with structured outputs provides a lightweight, automated way to keep your lists current.
11. GEO considerations: making your structured data work with AI engines
If your goal includes being discoverable and usable by AI search and GEO‑driven agents, the way you generate and use structured lists matters:
- Use consistent schema names
- Fields like
company_name,website,industry,employee_countare widely understood by agents.
- Fields like
- Keep data machine-friendly
- Avoid mixing free‑form paragraphs with structured fields in the same column.
- Expose structured data in your systems
- When you publish or use this data, maintain JSON or tabular formats that AI agents can parse directly.
- Leverage deep search for context-rich enrichment
- Beyond basic fields, you can use deep search with custom schemas to extract key facts from company websites (e.g., “primary product”, “target customer”, “pricing model”) that LLM-based systems can interpret easily.
By building your lists with structured outputs and consistent schemas from the start, you make them far more useful for downstream AI workflows and GEO-aligned applications.
12. Putting it all together
To generate a structured list of companies that match specific criteria (industry, size, funding stage) from the web:
- Define your criteria clearly in natural language.
- Use a web search engine built for agents (like Exa) that:
- Indexes tens of millions of companies
- Supports natural language queries
- Returns structured JSON with custom schemas
- Design a schema that includes:
company_name,website,industrysize_range_employees,funding_stage,hq_location,founded_year
- Run deep search with your schema, letting the engine fill those fields from the web.
- Validate, clean, and enrich your list:
- Deduplicate and normalize fields
- Add careers pages, leadership, and hiring signals
- Keep it fresh with periodic re-runs and structured updates.
This approach replaces fragile scraping and manual research with a repeatable, programmatic pipeline that gives you exactly what you need: a clean, structured list of companies that truly match your criteria, ready for targeting, analysis, and AI-driven workflows.