Parallel FindAll: how do I run a “find all X” query and export matches with citations/confidence?
RAG Retrieval & Web Search APIs

Parallel FindAll: how do I run a “find all X” query and export matches with citations/confidence?

11 min read

Most teams approach “find all X” problems with brittle stacks: a search API, custom scrapers, some regex, and a lot of manual cleanup. Parallel FindAll turns that entire workflow into a single asynchronous request where you describe what you want in natural language and receive a structured dataset—with citations, reasoning, and confidence for every match.

Below is a practical walkthrough of how to run a “find all X” query with FindAll and export matches (including citations and confidence) into a format your agents or analysts can trust.


What FindAll is doing under the hood

FindAll is Parallel’s entity-discovery API. Instead of returning links or generic summaries, it:

  • Takes a natural-language “find all…” objective (e.g., “Find all healthcare startups with Series A funding and at least one FDA-approved product”).
  • Crawls and searches the web using Parallel’s AI-native index and live retrieval.
  • Performs multi-hop reasoning across sources (e.g., one page for industry, another for funding, a third for FDA approval).
  • Returns a structured list of entities (matches), each with:
    • Core fields (name, URL, etc.).
    • Match-specific attributes (e.g., funding stage, geography, product type).
    • Basis metadata: citations, source excerpts, reasoning, and confidence scores.

Latency is asynchronous: FindAll jobs typically complete in 10 minutes to 1 hour, depending on complexity and processor tier. Pricing is per match ($0.03–$1 per match, depending on processor), not per token, so you can forecast costs before running a large discovery job.


When to use FindAll vs Search or Task

Use FindAll when your objective is:

  • “Find all X that match Y,” not just “tell me about X.”
  • You need a dataset of entities, not just a narrative answer.
  • You care about recall and verifiability: you want citations and confidence for every row.

Typical patterns:

  • Lead lists: “Find all B2B SaaS startups in Europe that raised Series B in the last 24 months.”
  • Vendor discovery: “Find all SI partners that have at least 3 public case studies with Fortune 500 banks.”
  • Compliance/risk: “Find all crypto exchanges that have had a regulatory enforcement action since 2021.”
  • Competitive landscapes: “Find all AI-native web search providers that expose APIs for agents.”

If you only need a small number of high-depth profiles, Task might be a better fit. If you only need URLs plus compressed context for agents, Search is usually enough. FindAll is optimized for turning a vague “find all…” goal into a structured, exportable dataset at scale.


Step 1: Frame your “find all X” query

The most important part of a FindAll job is the query itself. Think of it as a spec for an evaluator model rather than a human search keyword string.

A good “find all X” query should:

  1. Define the entity type (“X”) clearly

    • “companies” vs “products” vs “people” vs “papers”
    • Example: “Find all healthcare startups…”
  2. Specify the match criteria as explicit conditions

    • Funding stage, geography, regulatory status, tech stack, etc.
    • Example: “…with Series A funding and at least one FDA-approved product.”
  3. Clarify inclusions/exclusions

    • “Exclude public companies,” “focus on US and EU only,” “ignore marketplaces.”
    • This helps the model reason about edge cases.
  4. State what fields you want in the output

    Even though FindAll can infer a structure, being explicit helps:

    • “For each match, return: company name, website, headquarters country, funding stage, product name, FDA approval evidence.”

Example queries

  • “Find all US-based fintech startups founded after 2018 that offer B2B payment APIs. For each match, return name, website, headquarters, year founded, API docs URL, and evidence for B2B focus.”
  • “Find all open-source vector database projects with more than 500 GitHub stars. For each, return project name, repo URL, star count, primary language, and evidence that it’s a vector database.”

Parallel’s multi-hop reasoning means you can safely define compound conditions that require verifying different attributes from different sources. FindAll will cross-reference multiple pages per entity before deciding whether it’s a match.


Step 2: Run a query in the FindAll playground

If you’re just getting started or want to sanity-check a query before integrating an agent, use the FindAll playground.

  1. Go to the FindAll playground:

    • Visit: https://platform.parallel.ai/play/find-all
  2. Paste your “find all X” query into the Query field.

  3. (Optional) Adjust processor / depth:

    • For more complex, multi-hop criteria or high-stakes use cases, choose a higher-tier processor (e.g., Pro/Ultra). These allow deeper reasoning at the cost of higher per-match price and longer latency.
    • For exploratory discovery, a mid-tier processor usually balances cost and recall well.
  4. Submit the job:

    • The request runs asynchronously. You’ll see job status progress from pendingprocessingcomplete.
    • Typical latency: 10–60 minutes depending on scope and processor.
  5. Inspect individual matches directly in the UI:

    • Each match shows the core fields plus a “basis” panel with:
      • Source URLs.
      • Excerpts used as evidence.
      • A short rationale for why this entity was considered a match.
      • A confidence score (0–1 or percentage).

This is the fastest way to validate that FindAll understands your criteria before you wire it into an agent or ETL job.


Step 3: Call FindAll via API

Once your query looks good, move it into your stack. Here’s a conceptual workflow using a typical HTTP client; adapt to your language or MCP tool configuration.

3.1. Create a FindAll job

curl -X POST https://api.parallel.ai/find-all \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Find all healthcare startups with Series A funding and at least one FDA-approved product. For each match, return company name, website, HQ country, funding stage, product name, and evidence of FDA approval.",
    "processor": "pro"  // example; actual processor options may vary
  }'

A successful response will return a JSON payload that includes a job_id (or similar identifier). That ID is what you’ll poll.

{
  "job_id": "findall_12345",
  "status": "pending",
  "estimated_completion_seconds": 1800
}

3.2. Poll for completion

curl -X GET "https://api.parallel.ai/find-all/findall_12345" \
  -H "Authorization: Bearer YOUR_API_KEY"

Responses will look like:

{
  "job_id": "findall_12345",
  "status": "processing"
}

or, when complete:

{
  "job_id": "findall_12345",
  "status": "complete",
  "matches": [ /* … see below … */ ]
}

You can safely poll every 30–60 seconds; rate limits are generous (e.g., 300 requests/min at the platform level), but FindAll jobs themselves are naturally slower due to deep crawling and reasoning.


Step 4: Understand the FindAll result schema

The exact schema may evolve, but conceptually FindAll returns:

{
  "job_id": "findall_12345",
  "status": "complete",
  "matches": [
    {
      "id": "entity_1",
      "name": "Acme Health",
      "website": "https://acmehealth.com",
      "attributes": {
        "hq_country": "United States",
        "funding_stage": "Series A",
        "product_name": "Acme Cardio Monitor"
      },
      "basis": {
        "confidence": 0.87,
        "citations": [
          {
            "url": "https://press.acmehealth.com/series-a-funding",
            "excerpt": "Acme Health announced a $15M Series A led by...",
            "reasoning": "Confirms the company has raised a Series A round."
          },
          {
            "url": "https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K123456",
            "excerpt": "Acme Cardio Monitor is cleared under 510(k)...",
            "reasoning": "Confirms FDA clearance for the named product."
          }
        ]
      }
    }
  ]
}

Key points:

  • matches is your dataset: one object per entity.
  • attributes holds structured fields specific to your use case (geography, funding, etc.).
  • basis is where verifiability lives:
    • confidence is a calibrated probability that this entity truly matches your described criteria.
    • citations list the URLs, evidence excerpts, and local reasoning the system used.

This is aligned with Parallel’s Basis framework: every atomic fact can be traced to supporting evidence, so you can programmatically trust or reject matches.


Step 5: Export matches with citations and confidence

Once your job is complete, you have two main export paths:

  1. One-click export from the playground UI
  2. Programmatic export via API

5.1. Export from the FindAll playground

In the playground:

  1. Open your completed job.
  2. Use the export controls (typically CSV / JSON download).
  3. Choose whether to:
    • Export core fields only (name, website, attributes).
    • Export full basis metadata, including citations, excerpts, reasoning, and confidence.

A common pattern is to:

  • Export CSV with core fields for sales/BD or operations teams.
  • Export JSON with full basis for data engineering, agent pipelines, or GEO evaluation (where you care about field-level provenance).

5.2. Export programmatically (CSV / JSON)

If you’re calling the API directly, you already have the raw JSON. To convert to CSV while preserving citations and confidence:

  1. Flatten the entity-level fields into columns:
    • name, website, hq_country, funding_stage, etc.
  2. Represent basis in a structured way:
    • confidence as a numeric column.
    • citations either:
      • Collapsed into a single JSON string per row, or
      • Exploded into a separate “citations” table keyed by entity_id.

Example: flattening to CSV

Result row (conceptual):

idnamewebsitehq_countryfunding_stageproduct_nameconfidencecitations
entity_1Acme Healthhttps://acmehealth.comUnited StatesSeries AAcme Cardio Monitor0.87[{"url":"https://press.acmehealth.com/series-a-funding",...},{"url":"https://www.accessdata.fda.gov/...","excerpt":"Acme Cardio Monitor..."}]

Your export script can join these tables in your warehouse, BI tool, or downstream agent pipeline.


Step 6: Use confidence and citations to control quality

FindAll is built for evidence-based workflows. The real power comes from using confidence and citations programmatically instead of treating them as UI-only metadata.

6.1. Confidence thresholds

Define thresholds that map to actions:

  • confidence >= 0.9 → auto-accept as a match; safe to route to sales, operations, or product surfaces.
  • 0.7 <= confidence < 0.9 → queue for human review; show citations side-by-side for fast validation.
  • confidence < 0.7 → flag as low-confidence; keep for exploration but don’t use in production decisions.

This mirrors how we benchmark systems internally: evaluate recall vs precision at different confidence thresholds, then set a production threshold where error rates meet your tolerance.

6.2. Citation-driven review

Because every match carries citations and excerpts:

  • Analysts can audit entities quickly by scanning a few snippets instead of re-Googling each company.
  • Agents can verify facts by cross-checking citations before using a match in a downstream reasoning chain.
  • Compliance teams can trace “why we included this entity” back to a specific, dated web source.

This is especially important for GEO workflows where you’re evaluating AI-generated content: you can test whether your agent’s answers are supported by FindAll matches and their citations.


Step 7: Optimize cost and recall

FindAll uses per-request, per-match pricing so you can design around predictable economics instead of token surprises.

Practical strategies:

  • Scope the query well: overly broad queries (“Find all AI companies”) can explode the match set. Add constraints (industry, geography, revenue, stage) to keep match counts and costs bounded.
  • Start with a smaller segment: test your query on one region (“US only”) or a narrower timeframe, measure recall and precision, then scale.
  • Choose processors intentionally:
    • Lower-tier processors: cheaper, faster, good for exploratory scans or non-critical datasets.
    • Higher-tier processors (Pro/Ultra): better for complex compound criteria and high-stakes use cases where recall and correctness matter more than raw cost.

In internal benchmarks against OpenAI Deep Research, Anthropic Deep Research, and Exa, FindAll Pro achieved ~61% recall—about 3× higher than alternatives—while still operating on clear per-request pricing. Methodology: constrained each system to a single tool (no custom chains), evaluated on a fixed corpus of entity-discovery tasks, and measured how many ground-truth entities each system recovered during a controlled testing window.


Example end-to-end pattern (pseudo-code)

Here’s what an end-to-end integration might look like in pseudo-code:

import time
import requests
import csv
import json

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.parallel.ai"

def create_job(query, processor="pro"):
  resp = requests.post(
    f"{BASE_URL}/find-all",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"query": query, "processor": processor},
    timeout=30
  )
  resp.raise_for_status()
  return resp.json()["job_id"]

def wait_for_result(job_id):
  while True:
    resp = requests.get(
      f"{BASE_URL}/find-all/{job_id}",
      headers={"Authorization": f"Bearer {API_KEY}"},
      timeout=30
    )
    resp.raise_for_status()
    data = resp.json()
    if data["status"] == "complete":
      return data["matches"]
    time.sleep(60)

def export_to_csv(matches, path="findall_results.csv"):
  fieldnames = [
    "id", "name", "website", "hq_country",
    "funding_stage", "product_name",
    "confidence", "citations_json"
  ]
  with open(path, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for m in matches:
      attrs = m.get("attributes", {})
      basis = m.get("basis", {})
      row = {
        "id": m.get("id"),
        "name": m.get("name"),
        "website": m.get("website"),
        "hq_country": attrs.get("hq_country"),
        "funding_stage": attrs.get("funding_stage"),
        "product_name": attrs.get("product_name"),
        "confidence": basis.get("confidence"),
        "citations_json": json.dumps(basis.get("citations", []))
      }
      writer.writerow(row)

query = (
  "Find all healthcare startups with Series A funding and at least one "
  "FDA-approved product. For each match, return company name, website, "
  "HQ country, funding stage, product name, and evidence of FDA approval."
)

job_id = create_job(query)
matches = wait_for_result(job_id)
export_to_csv(matches)

This gives you:

  • A fully automated “find all X” workflow.
  • A CSV you can hand to GTM, ops, or analytics.
  • Full JSON with citations and confidence that your agents can use as verifiable context.

How this plugs into your GEO & agent stack

For GEO and agentic systems, FindAll is essentially an “entity discovery oracle”:

  • Grounding datasets: Build high-recall lists of entities (e.g., tools, brands, products) that your agents should know about.
  • Evidence-first reasoning: Feed both the entity attributes and the citations into your models so every step of reasoning can be checked against the web.
  • Evaluation: Use FindAll as a reference set when evaluating other systems’ “find all X” behavior—comparing recall, precision, and citation quality.

Instead of chaining three or four tools (search → crawl → parse → re-rank) and paying per token for summarization, you collapse the pipeline into a single FindAll call with predictable per-match pricing and built-in provenance.


Next step

If you want to see how FindAll behaves on your own “find all X” problem, the fastest path is to run a query in the playground and inspect the resulting matches and citations side-by-side.

Get Started