
per-request pricing vs token-based pricing for web-grounded agents (unit economics and forecasting)
Most teams discover the pricing trap for web-grounded agents the hard way: the prototype works, users love it, and then the first real bill lands—driven not by queries, but by a long tail of unpredictable tokens. If your agent has to browse, extract, and reason over the open web, the difference between per-request pricing and token-based pricing isn’t academic; it’s the difference between a system you can forecast and one you’re constantly throttling.
This guide walks through how the two models behave for web-grounded agents, how they affect unit economics, and how to design pricing-aware architectures that won’t blow up when you move from 100 to 10M requests.
Why pricing models matter more for web-grounded agents
Web-grounded agents don’t just call a model once. A single user query can trigger:
- Multiple search calls
- Dozens of page fetches and parses
- Long-context prompts to synthesize evidence and generate an answer
- Follow-up calls when facts conflict or pages change
Each of those steps expands or contracts token usage in ways that are hard to predict:
- A “simple” question may resolve in 2–3 pages and a short answer.
- A compliance or due-diligence query might traverse 100+ pages, citations, and long reasoning chains.
With token-based pricing, your cost per task scales with all of that complexity—especially if you chain search, browse, summarize, and re-ask steps. With per-request pricing, your cost per task is dominated by the number of API calls, not the variability of the content those calls return.
For teams deploying production agents with web grounding, that difference is the core of unit economics and forecasting.
Definitions: per-request vs token-based pricing
Before comparing economics, it’s important to be precise about the two models.
Per-request pricing
- What it is: You’re charged a fixed amount for each API call, independent of result size.
- Common for: Web search APIs, extraction/crawling APIs, enrichment pipelines, some retrieval providers.
- Behavior: Cost scales linearly with calls (queries, pages fetched, tasks), not with how much text each call returns.
Examples in a web-grounded stack:
- 1 Search API call → fixed fee
- 10 Extract API calls for 10 URLs → 10 × fixed fee, regardless of page length
- 1 Task/FindAll job → fixed fee, even if it processes 50 or 500 pages under the hood
Token-based pricing
- What it is: You’re charged based on tokens in and/or out (context length + output length).
- Common for: Foundation model APIs, chat/browsing tools that fetch-and-summarize, some retrieval providers that meter by returned text length.
- Behavior: Cost scales with:
- Number of pages fetched and summarized
- Length of those pages
- Prompt complexity and system instructions
- Answer length and style (short vs long-form)
In practice, web-grounded agents often pay for tokens:
- When they call “browser + summarize” tools that meter by tokens processed
- When they feed raw scraped content into an LLM for reasoning
How each pricing model behaves in web-grounded workflows
Let’s look at the same workload under both models: a web-grounded agent answering user questions.
Under per-request pricing
Cost per query is a simple function:
Cost per query ≈ (search calls × search CPM) + (extract calls × extract CPM) + (tasks × task CPM)
Where CPM is cost per 1,000 requests.
Key properties:
- Predictable: Once you know your average call counts (e.g., 3 searches + 10 extracts per user query), you can forecast spend at 1K, 1M, or 100M queries.
- Bounded by design: You can hard-cap:
- Max search calls per question
- Max pages extracted
- Max deep research tasks initiated
- Decoupled from page length: A 40KB legal opinion and a 4KB blog post cost the same to extract.
This is how Parallel’s APIs are designed: you pay per query or per task, not per token, with a clear curve across processors (Lite/Base/Core/Pro/Ultra) so you can choose how much compute to allocate per request.
Under token-based pricing
Cost per query looks more like:
Cost per query ≈ Σ (tokens in per call + tokens out per call) × token rate
But each component varies:
- Some pages are 500 tokens; others are 20,000+.
- Some questions resolve quickly; others trigger long multi-hop chains.
- Agent bugs can balloon token usage (looping, redundant browsing, verbose reasoning).
Key properties:
- Highly variable: The same endpoint can cost 10x more for complex questions.
- Harder to cap without degrading quality: Strict token ceilings often truncate context or reasoning just when you most need depth.
- Opaque in advance: You rarely know token counts before running the workflow, especially if browsing is dynamic.
Unit economics: concretes instead of abstractions
Let’s quantify how this plays out for web-grounded agents. Numbers are illustrative; the shape is what matters.
Scenario 1: FAQ-grade questions (simple web grounding)
- Avg search calls per query: 1
- Avg pages needed: 3
- Avg context per page: 1,000 tokens
- Answer: 150 tokens
Per-request stack:
- 1 Search call + 3 Extract calls
- Suppose:
- Search CPM = $1.00 (USD per 1,000 queries)
- Extract CPM = $2.00
- Cost per query:
- Search: $1.00 / 1,000 = $0.001
- Extract: 3 × ($2.00 / 1,000) = $0.006
- Total ≈ $0.007/query
Token-based “browse+summarize” stack:
- Tokens per query:
- Input pages: 3 × 1,000 = 3,000
- System + question: 500
- Answer: 150
- Total ≈ 3,650 tokens
- If rate is $5.00 per 1M tokens:
- 3,650 / 1,000,000 × $5.00 ≈ $0.01825
- Total ≈ $0.018/query
Even with simple questions, small variations in page length drive token cost; per-request cost stays flat as long as call counts are stable.
Scenario 2: Deep research (complex web grounding)
Realistic for compliance, legal, or due-diligence workflows.
- Avg search calls: 4 (multi-hop)
- Avg pages needed: 40
- Avg context per page: 2,500 tokens (long docs)
- Final report: 2,000 tokens of structured analysis
Per-request stack (Search + Extract + Task):
- Calls:
- 4 Search
- 40 Extract
- 1 Task (deep research job)
- Suppose:
- Search CPM = $1.00
- Extract CPM = $2.00
- Task CPM (Core/Pro processor) = $50.00
- Cost per query:
- Search: 4 × ($1.00 / 1,000) = $0.004
- Extract: 40 × ($2.00 / 1,000) = $0.08
- Task: $50.00 / 1,000 = $0.05
- Total ≈ $0.134/query
Token-based “automated browsing + long-context LLM” stack:
- Tokens per query:
- Pages: 40 × 2,500 = 100,000
- Reasoning chain + prompts: ~10,000
- Report: 2,000
- Total ≈ 112,000 tokens
- At $5.00 per 1M tokens:
- 112,000 / 1,000,000 × $5.00 = $0.56
- Total ≈ $0.56/query
Here you see why deep research is where token-based pricing hurts most: performance tends to require more context, which directly increases cost. Under per-request pricing, cost is dominated by the number of tasks, not how much text each one processes.
Forecasting: why per-request pricing is easier to model
For web-grounded workloads, finance and infra leads care about two questions:
- What is our cost per user query at steady-state?
- How does cost scale when usage 10x’s?
Forecasting with per-request pricing
You can build a straightforward model from logs:
-
Measure call counts per logical task
- Average Search calls per user query
- Average Extract calls per user query
- % of queries that trigger Task/FindAll jobs, and how many per query
-
Multiply by CPM
- For each API:
unit_cost = (avg_calls_per_query × CPM) / 1,000
- Sum across Search, Extract, Task, etc.
- For each API:
-
Scenario-plan usage
- Cost at 100K, 1M, 10M queries is linear.
- You can confidently say: “At our current behavior, 1M monthly queries cost ≈ $X.”
This is particularly clean with Parallel’s architecture because:
- Each API uses a per-request model.
- Processor tiers (Lite/Base/Core/Pro/Ultra) give you explicit cost bands per call.
- You can route queries to different processors based on complexity, but each branch still has fixed per-request pricing.
Forecasting with token-based pricing
Forecasting token spend requires assumptions on multiple axes:
- Avg pages visited per query
- Avg tokens per page, by domain
- Avg reasoning tokens per query
- Agent loop behaviors (retries, follow-ups)
- Output length and style
Even if you backtest on logs, two issues remain:
- Long tail risk: A small fraction of “complex” queries can dominate total tokens, especially where web context is dense (e.g., PDFs, regulatory filings).
- Model drift and prompt changes: Any adjustment to system prompts, citation behavior, or reporting style can change tokens per query, invalidating prior forecasts.
Mitigation patterns (token ceilings, truncation, summarizing before retrieval) often degrade answer quality or verifiability precisely where stakes are highest.
Comparing per-request vs token-based for web-grounded agents
Accuracy and reliability
-
Per-request:
- Neutral to accuracy by itself—it’s a billing model—but it allows you to:
- Run deep retrieval without worrying about token explosion.
- Use specialized tools (Search, Extract, Task, FindAll, Monitor) that return structured, citation-rich outputs.
- In Parallel’s case, outputs use the Basis framework (citations, rationale, confidence) so agents can trust or programmatically reject each atomic fact.
- Neutral to accuracy by itself—it’s a billing model—but it allows you to:
-
Token-based:
- Puts upward pressure on truncation and aggressive summarization to control cost.
- Browsing/summarization stacks often compress away the evidence field-by-field, reducing verifiability.
Cost and unit economics
-
Per-request:
- Linear with call counts.
- Clear CPM per API, easy to model.
- Lets you design hard limits per workflow step (max URLs, max tasks).
-
Token-based:
- Non-linear with query complexity.
- Exposed to long-tail costs on deep research.
- Difficult to confidently forecast as usage or question mix changes.
Latency and performance
Pricing doesn’t directly dictate latency, but it shapes architecture:
-
Per-request stacks like Parallel:
- Search: synchronous, <5 seconds for ranked URLs + compressed excerpts.
- Extract: ~1–3 seconds from cache; 60–90 seconds for live crawling.
- Task: asynchronous, 5 seconds to ~30 minutes depending on processor (Lite → Ultra8x).
- FindAll: asynchronous, 10 minutes–1 hour for large entity sets.
- You choose processors to trade off latency vs depth on a clear per-request cost curve.
-
Token-based browsing + summarize:
- Latency grows with:
- Number of pages browsed
- Model size
- Token count per call
- Controlling latency often means limiting pages (hurting recall) or context size (hurting reasoning).
- Latency grows with:
Verifiability and provenance
-
Per-request with structured web APIs:
- Search returns URLs + compressed excerpts you can store.
- Extract returns full page contents + excerpts.
- Task/FindAll/Monitor can attach citations, rationale, and confidence per field via Basis.
- You can reconstruct and audit every atomic fact.
-
Token-based summarization stacks:
- Often return free-form text with partial or missing citations.
- Evidence is not first-class; outputs are.
- Harder to implement programmatic checks (“accept this field only if confidence > 0.9 and at least 2 independent sources agree”).
Designing pricing-aware architectures for web-grounded agents
If you want predictable unit economics, you need to design retrieval and research flows to match per-request pricing, not just model calls.
1. Separate retrieval from reasoning
- Use web-native retrieval APIs (Search, Extract, FindAll, Monitor) with per-request pricing.
- Use LLMs for reasoning over compressed excerpts, not for browsing the web themselves.
- This lets you:
- Control retrieval cost via request counts.
- Control reasoning cost via a smaller, denser context.
With Parallel:
- Search API: One call returns ranked URLs + token-dense excerpts tuned for LLM consumption.
- Extract API: One call per URL gives you full contents + compressed excerpts (cached vs live).
- Task API: One call per research/enrichment task yields structured JSON with Basis (citations, reasoning, confidence).
2. Allocate compute by task complexity
Instead of letting token count implicitly determine cost, decide up front how much compute each task deserves.
- Route:
- Simple Q&A → Search + small Base/Core processor.
- Mid-tier research → Search + Extract + Task on Core/Pro.
- Heavy-duty diligence → Task on Pro/Ultra/Ultra8x.
Because each processor has a known per-request price, you’re effectively implementing “compute classes” instead of letting tokens decide spend post hoc.
3. Define hard caps on API calls, not tokens
For web-grounded agents:
- Cap:
- Max search calls per query (e.g., 3–5)
- Max pages extracted (e.g., 20–50 per research job)
- Max Task or FindAll tasks per user action
Soft-caps can be dynamic:
- Start with a small number of pages; expand only if confidence is low or sources conflict.
- Stop when Basis confidence crosses a threshold across fields.
This way, your worst-case cost is well-bounded and directly inspectable from logs.
4. Use evidence-based outputs to reduce downstream calls
Dense, citation-rich outputs reduce the need for extra model calls:
- Compressed excerpts from Search/Extract carry more signal per token than raw HTML.
- Task/FindAll outputs arrive as structured JSON with citations and confidence per field.
- Agents can:
- Avoid re-querying the web for the same facts.
- Skip re-running the LLM when confidence is already high and evidence is strong.
That means fewer total requests and fewer tokens per downstream reasoning call, improving both cost and latency while maintaining verifiability.
When token-based pricing can still make sense
Per-request pricing is a better fit for web-grounded retrieval and research, but token-based pricing remains logical for:
- Core LLM calls for generation (chat interfaces, long-form writing) where:
- You explicitly want output length to scale with user needs.
- The model isn’t doing open-ended web browsing.
- One-off exploration or prototyping where:
- You’re experimenting with agents that directly use “browser + summarize” tools.
- You haven’t yet stabilized your retrieval patterns.
In those cases, apply guardrails:
- Set maximum tokens per call and per user session.
- Monitor top percentiles (P95, P99) of tokens per request, not just averages.
- Gradually move web grounding to per-request retrieval APIs once patterns stabilize.
Practical decision framework
To decide between per-request and token-based pricing for your web-grounded agents, ask:
- Does your agent rely heavily on the web?
- If yes (search, verification, deep research), favor per-request retrieval.
- Do you need predictable costs at scale?
- If budget must be forecastable for 10x usage, per-request is easier to model.
- Do you operate in regulated or audited environments?
- If you need field-level provenance and citations, use web APIs that produce structured, evidence-based outputs.
- What’s the mix of simple vs complex queries?
- If there’s a heavy tail of complex research questions, token-based pricing exposes you to cost spikes while per-request keeps a linear spend curve.
- Can you segment compute by task type?
- If yes, combining per-request web APIs with selectable processor tiers (Lite → Ultra) lets you match spend to complexity explicitly.
For most production-grade, web-grounded agents, the dominant pattern looks like:
- Per-request for web retrieval and research (Search, Extract, Task, FindAll, Monitor).
- Token-based for core LLM reasoning/generation, but with bounded contexts fed by compressed, citation-rich excerpts.
That architecture gives you the best of both worlds: predictable per-query web costs plus flexible reasoning overhead.
Final verdict
For web-grounded agents, per-request pricing is the more stable unit of account. It lets you:
- Tie cost directly to observable behavior (how many searches, how many pages, how many research tasks), not emergent token dynamics.
- Design workflows where maximum spend is known before execution.
- Run deeper, evidence-based retrieval without being penalized for page length or model verbosity.
- Attach verifiability—citations, rationale, calibrated confidence—to every atomic fact, rather than compressing everything into opaque summaries.
Token-based pricing still has a role for core LLM calls, but if your agent’s value depends on live, accurate, and auditable web grounding, you’ll get better economics and better control by moving web access to per-request infrastructure.