
per-request pricing vs token-based pricing for web-grounded agents (unit economics and forecasting)
Most teams don’t lose control of web-grounded agent costs because of models—they lose it in the retrieval layer. The difference between per-request pricing and token-based pricing is often the difference between forecastable unit economics and “why did this browsing run cost 10x more than last week?”
This guide breaks down how each pricing model behaves under real workloads and how to design predictable unit economics for web-grounded agents.
Perspective: I’ve shipped agents that had to ground answers on the open web with hard budget caps and auditable provenance. After watching token-metered browsing stacks explode costs unpredictably, I now treat “pay per query, not per token” as a design requirement for production systems.
Why pricing models matter more for web-grounded agents
Web-grounded agents have a very different cost profile from static-RAG use cases:
- They fan out across pages: one user query may trigger dozens or hundreds of fetches and extractions.
- They vary wildly in complexity: “what’s the latest CPI print?” vs. “summarize every open-source vector DB’s TCO” are not in the same league.
- They are long-lived and autonomous: you often don’t have a human watching every tool call.
That means variance is the enemy. Two tasks that look similar at the UX layer can differ by 10–100x in retrieval work, and token-based pricing happily passes that variance straight into your bill.
To reason about unit economics, we need to know:
- What is my cost per query or per user task?
- How does that cost scale as agents run more complex jobs?
- How much can my worst-case request cost?
Pricing model choice decides how answerable those questions are.
Definitions: per-request pricing vs token-based pricing
Per-request pricing
Per-request pricing charges a fixed amount per API call, regardless of result size.
- You pay the same when a provider returns 3 compressed excerpts as when it returns 30, as long as it’s the same endpoint.
- Some web providers also break this down by processor tier (e.g., Lite/Base/Core/Pro/Ultra), where each tier has:
- A fixed cost per request (e.g., CPM—USD per 1,000 requests);
- A predictable latency band (e.g., “Search <5s,” “Task 5s–30min”).
For web-grounded agents, this model lets you treat each tool call as a line item in your unit economics: search calls cost X, extract calls cost Y, deep research tasks cost Z.
Token-based pricing
Token-based pricing charges based on tokens processed, usually:
- Tokens generated by a model (“output tokens”);
- Tokens in your prompts and retrieved context (“input tokens”);
- In browsing setups, sometimes tokens scraped from the web pages as part of a summarization/browsing tool.
Implications:
- The cost of a single “web research” call can vary by an order of magnitude depending on:
- How many pages were fetched;
- How long those pages were;
- How verbose the summarization and reasoning chain was.
Token-based pricing itself isn’t bad. But it makes unit economics for web-grounded agents far harder to forecast, because retrieval workloads are inherently variable.
How each model behaves for web-grounded agents
Cost structure: deterministic vs variable
Per-request pricing
- Deterministic per call: same price no matter how much content is returned.
- Good fit for:
- Search APIs returning token-dense compressed excerpts instead of full pages;
- Extraction APIs where you’re billed on each extraction, not on page length;
- Asynchronous research tasks with a fixed “ticket price” per task.
Token-based pricing
- Stochastic per call: effective price per “web research” varies with:
- Number of URLs visited;
- Length and complexity of pages;
- How much “thinking” the model does in the browsing chain.
- A single investigation can process hundreds of pages, generating massive token bills.
From a unit economics standpoint:
- Per-request lets you say “this user action uses ~3 search calls + ~5 extracts = $X.”
- Token-based forces you into probabilistic estimates: “typical tasks cost between $A and $10A, depending on how deep the agent goes.”
Forecasting & budgeting
Per-request pricing
-
Forecastable: you model usage in requests, not tokens:
- “Our agent does 5 searches and 10 extracts per user task.”
- “At 100k tasks/day, that’s 1.5M requests/day.”
-
Easy to do CPM-style planning:
- If an endpoint costs $Y per 1,000 requests, you can translate that directly to:
- Cost per query;
- Cost per dataset row;
- Cost per monitored entity.
- If an endpoint costs $Y per 1,000 requests, you can translate that directly to:
-
Works well with rate limits and SLOs:
- You know how many requests/sec your agents can make;
- You know the max daily cost at that rate.
Token-based pricing
- Hard to forecast for retrieval-heavy tasks:
- A “simple” user question might resolve quickly with minimal browsing;
- A “complex” one might trigger recursive research across hundreds of pages.
- Complexity grows with:
- Larger context windows;
- More agentic chains (e.g., browse → summarize → re-browse);
- Long-lived sessions.
Mitigation strategies many teams end up building:
- Explicit token budgets per research task;
- Monitoring/alerting on average tokens per request and 95th percentile outliers;
- Guardrails that truncate long pages or limit browsing depth.
Those are all additional systems you have to build and maintain.
Latency and UX implications
Pricing and latency are coupled in practice.
Per-request with processor tiers
Some AI-native web providers (like Parallel) expose processor tiers:
- Lite/Base for fast, shallow tasks;
- Core/Pro/Ultra for deeper, more compute-heavy tasks.
You choose a tier per request, trading off:
- Latency: e.g., Search <5s vs. Task 5–30 minutes;
- Depth of processing: how exhaustive the cross-referencing and extraction is;
- CPM: higher tiers cost more per 1,000 requests.
Effectively, you allocate compute budget based on task complexity with a known price and latency band per tier.
Token-based browsing stacks
Token-based browsing tools often have:
- Less explicit control over how many pages the agent will read;
- Less direct linkage between “depth of research” and “price band,” because both are functions of tokens, not requests.
You can impose timeouts to cap latency, but that doesn’t give you a clean per-task price curve. Long investigations can still spike both time and cost.
Comparing unit economics in practice
Let’s walk through two concrete scenarios.
Scenario 1: Web-grounded chat agent
You’re building a customer-facing agent that:
- Calls a Search API to get 10–20 highly relevant URLs with compressed excerpts;
- Calls an Extract API to fetch full content for 3–5 of those URLs;
- Sends the excerpts + question to your LLM to answer with citations.
Under per-request pricing
Assume (hypothetical numbers for illustration):
- Search: $1 CPM (per 1,000 requests), <5s latency;
- Extract: $2 CPM, ~1–3s cached, ~60–90s live.
If each user query triggers:
- 1 Search call;
- 4 Extract calls (average).
Then per 1,000 queries, you pay for:
- 1,000 Search requests = $1;
- 4,000 Extract requests = $8.
Total retrieval cost: $9 per 1,000 user queries.
Your unit economics are:
- Cost per query: $0.009 (just under one cent);
- Scales linearly with query volume;
- Worst-case: an especially complex query might add a few more Extract calls—but still in the same order of magnitude.
You can commit in advance: “1M queries/month ≈ $9,000 in retrieval.”
Under token-based pricing
Now imagine a browsing stack where you pay:
- Per page fetched and summarized;
- Per token of the combined prompt and reasoning.
For “easy” queries, maybe the agent:
- Visits 3–5 pages;
- Summarizes them once.
For “hard” queries, maybe it:
- Visits 20+ pages across multiple sites;
- Generates multiple internal summaries before answering.
Result:
- Cost per query ranges from X to 10X, depending on how many pages and tokens are involved.
- It’s harder to tell, before launch, whether 1M queries/month will cost $5k or $50k.
The story is similar for API integrations: there’s no simple “cost per search call” line item; you’re estimating average tokens per browsing task and hoping agent behavior doesn’t shift.
Scenario 2: Dataset enrichment / Find all entities
You’re using a “find all…” style agent to:
- Discover all vendors in a category;
- Enrich each with metadata (pricing, features, URLs, etc.);
- Produce a structured JSON dataset.
Per-request pricing
With a system like Parallel’s FindAll + Task APIs, you might:
- Pay a fixed amount per FindAll objective (e.g., “Find all SOC2-compliant AI search providers”);
- Pay a fixed amount per Task-based enrichment request for each entity.
If one FindAll run produces 200 entities:
- FindAll call = 1 request;
- 200 Task calls (one per entity) to enrich with fields like “pricing page URL,” “SOC2 status,” “API latency band,” each output carrying citations and confidence.
Your economics:
- Cost per dataset row is deterministic:
- (FindAll CPM + Task CPM per row) / #rows.
- You can safely run large crawls, because:
- Per-request step function: doubling entities doubles cost—no surprises;
- You can choose cheaper processors (Lite/Base) for lightweight fields and more expensive processors only where needed.
Token-based pricing
Under a token-based “autonomous browsing” agent:
- Cost depends on:
- How many candidate websites the agent explores;
- How deep each crawl goes;
- How verbose the internal reasoning and summarization is;
- Number and length of extraction prompts.
Again, you can cap the number of pages or tokens, but then you’re trading off recall and accuracy against cost with much fuzzier control.
From an economic modeling standpoint:
- It’s much harder to say “this dataset will cost $500” upfront;
- You’re more likely to either:
- Over-crawl and blow the budget;
- Or over-constrain the agent and miss entities/fields.
Mixed models: when token-based pricing makes sense
To be clear: token-based pricing is perfectly reasonable for LLM inference. Where it gets painful is when the same token meter is used for web retrieval and browsing.
A pattern I’ve seen work well:
- Use per-request pricing for web intelligence:
- Search, Extract, Task, FindAll, Monitor;
- Each with a known CPM and latency band.
- Use token-based pricing for:
- The final answer generation;
- Internal reasoning steps that don’t massively scale with page length or web depth.
This hybrid approach gives you:
- Predictable retrieval costs (the biggest variable in web-grounded tasks);
- Flexible model choices for generation, where token variance is smaller relative to the total bill.
Design principles for forecastable web-grounded agents
If you care about unit economics and forecasting, your retrieval stack should satisfy these constraints:
-
Predictable per-call cost
- Retrieval APIs should be per-request priced, not per-token, especially for:
- Web search;
- Web extraction;
- Deep research / enrichment tasks;
- Monitoring jobs.
- You want to budget in requests, not tokens.
- Retrieval APIs should be per-request priced, not per-token, especially for:
-
Composable processor tiers
- You need a way to allocate more compute to complex tasks without blowing up unit economics:
- Fast, cheap processors for lightweight lookups;
- Slower, more expensive processors for deep research.
- Each tier should come with:
- Clear CPM;
- Latency band (seconds vs minutes);
- Known accuracy profile (ideally with benchmarks).
- You need a way to allocate more compute to complex tasks without blowing up unit economics:
-
Separation of retrieval and generation costs
- Retrieval (search/extract/research) should not depend on:
- How verbose the model’s final answer is;
- How many internal reasoning tokens the agent burns.
- Generation can stay token-based; retrieval should be query-based.
- Retrieval (search/extract/research) should not depend on:
-
Evidence-based outputs by default
- Each retrieval call should return:
- Citations (URLs, anchors);
- Rationale/reasoning for why a fact was extracted or an entity was matched;
- Calibrated confidence per field.
- This lets you:
- Programmatically reject low-confidence fields;
- Avoid extra LLM passes for manual validation (which would otherwise add token costs).
- Each retrieval call should return:
-
Rate limits aligned with economics
- Rate limits (requests/sec, requests/day) and per-request pricing should be aligned:
- At max throughput, you should be able to compute a hard upper bound on daily spend.
- For example: “At 100 requests/sec, with $X CPM, max daily spend is $Y.”
- Rate limits (requests/sec, requests/day) and per-request pricing should be aligned:
How Parallel approaches per-request economics
Parallel is built explicitly around per-request, tiered pricing for web intelligence:
- Search:
- Returns ranked URLs + token-dense compressed excerpts in <5s.
- Priced per request; you know the cost of every search tool call your agent makes.
- Extract:
- Returns full page contents + compressed excerpts.
- Cached responses ~1–3s, live crawling ~60–90s.
- Again, per request, not per token.
- Task:
- Asynchronous deep research and enrichment (5s–30min).
- Outputs structured JSON conforming to your schema, with field-level citations, reasoning, and confidence via the Basis framework.
- FindAll:
- “Find all entities that match this objective” into a dataset.
- 10–60 minutes typical, depending on scope.
- Per-request pricing makes “cost per dataset row” modelable.
- Monitor:
- “Monitor any event on the web,” emitting new events with citations when changes occur.
- You pay per monitored request and per event, not per token of every change summary.
Economically, this means:
- You can design your agent’s tools as a cost graph:
- “This tool calls Search once and Task twice.”
- “This monitoring workflow calls Monitor for 10k entities.”
- You can forecast:
- Cost per user interaction;
- Cost per research job;
- Cost per monitored entity per month.
- You can confidently scale to millions of daily requests without discovering afterward that a handful of complex tasks consumed disproportionate spend.
Parallel publishes benchmarks like HLE, BrowseComp, DeepResearch Bench, RACER, WISER-Atomic, WISER-FindAll to show accuracy across the Pareto frontier of cost vs latency, and offers SOC-II Type 2 controls for teams with regulatory constraints.
When to choose per-request vs token-based pricing
Use this as a decision heuristic for web-grounded systems:
-
Choose per-request retrieval when:
- You’re building long-lived or high-volume agents;
- You need predictable spending across highly variable queries;
- You care about field-level provenance (citations, rationale, confidence);
- You expect to hit non-trivial rate limits and need per-request economics.
-
Accept token-based retrieval when:
- Your browsing workloads are small and manually monitored;
- Variance in cost is acceptable;
- You’re experimenting, not yet operating at scale.
In production, most teams end up discovering the hard way that retrieval variance dominates their cost model. Moving that part of the stack to per-request, processor-tiered APIs is usually where predictability returns.
Final verdict
For web-grounded agents, per-request pricing wins on unit economics and forecasting. It lets you:
- Model cost per query, per research task, or per dataset row;
- Control cost vs depth explicitly via processor tiers;
- Put hard upper bounds on daily and monthly spend.
Token-based pricing remains useful for LLM generation and reasoning, but tying your web retrieval and browsing directly to tokens leads to unpredictable, sometimes runaway costs—especially as agents become more autonomous.
If you’re designing agents to treat the web as their second native environment, structure your retrieval layer around per-request, evidence-based APIs and keep token meters focused on generation where variance is easier to manage.