
Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?
Most teams trying to add “web-aware” streaming chat to their agents hit the same wall: you can stream tokens, or you can ground on the web with citations, but not both in a clean, OpenAI-compatible way. Parallel’s Chat API is designed to close that gap: it gives you OpenAI-style /v1/chat/completions and SSE streaming, but the underlying response is grounded on Parallel’s AI-native web index with citations and rationale attached.
This guide walks through how to use the OpenAI-compatible streaming endpoint, how web grounding is actually wired in, and how to surface citations in your UI or downstream agent logic.
At-a-Glance: What You Get with Parallel Chat
Parallel Chat is built for agents, not humans skimming a SERP. When you call the OpenAI-compatible /v1/chat/completions endpoint with streaming enabled, you get:
- Web-grounded answers by default (when you use Parallel’s tools), backed by:
- Live crawling plus an AI-native web index
- Token-dense compressed excerpts instead of snippet-style search results
- OpenAI-compatible interface:
- Same HTTP method, path, and core request schema (
model,messages,stream) - SSE streaming format that follows
data: { "id": ..., "choices": [...] }
- Same HTTP method, path, and core request schema (
- Citations and rationale via Basis:
- Each grounded fact is backed by one or more URLs and excerpts
- Parallel attaches provenance and calibrated confidence so you can trace every atomic fact
Think of this as “ChatGPT-style streaming,” but where the content is produced on top of Parallel Search / Extract / Task rather than opaque browsing + summarization.
Core Concepts: Web Grounding and Basis in Chat
Before touching code, it helps to be explicit about the mechanics.
Web grounding
In Parallel, “web grounding” is not just “the model can browse.” Instead, the Chat Processor can:
- Call Parallel Search to retrieve ranked URLs plus compressed excerpts (in <5 seconds)
- Use Extract and live crawling to pull full page contents when needed
- Feed that high-density context into the model with a fixed, predictable retrieval cost
You can either:
- Let the system handle retrieval automatically (recommended for most chat use cases), or
- Wire your own tool schema / MCP-style tools that call Search/Extract/Task from your agent loop.
Basis: citations, rationale, confidence
Parallel’s Basis framework shows up in Chat in two main ways:
- Citations: URLs and short excerpts tied to specific claims
- Rationale and confidence: model-visible reasoning plus calibrated confidence scores for each atomic fact
You can surface this in your UI as “Sources” under an answer or use it programmatically to:
- Reject low-confidence facts
- Ask the model to re-verify a specific field
- Log provenance for regulated workflows
Endpoint Overview: OpenAI-Compatible Streaming
Parallel exposes a chat endpoint that mirrors OpenAI’s standard:
POST https://api.parallel.ai/v1/chat/completions
Authorization: Bearer YOUR_PARALLEL_API_KEY
Content-Type: application/json
Accept: text/event-stream
Key points:
- Path:
/v1/chat/completions - Method:
POST - Streaming: set
stream: truein the JSON body - Response: Server-Sent Events (SSE) with
data: ...chunks, plus a final[DONE]sentinel
If you’re already using the OpenAI SDKs, you can often swap the base URL and API key, then enable Parallel-specific configuration.
Minimal Streaming Example (with Web Grounding Enabled)
Assume you want an answer grounded on the current web with citations.
Basic request payload
{
"model": "parallel-chat-core",
"stream": true,
"messages": [
{ "role": "system", "content": "You are a helpful assistant that always cites web sources." },
{ "role": "user", "content": "Summarize the latest research on retrieval-augmented generation evaluation, with citations." }
],
"parallel": {
"web_grounding": {
"enabled": true,
"max_search_calls": 3
},
"citations": {
"enabled": true
}
}
}
Notes:
model: pick a model hosted by Parallel for chat (e.g.,parallel-chat-core,parallel-chat-pro, etc.). The exact names will track Parallel’s current offerings.parallel.web_grounding.enabled: tells the Processor to call Search/Extract behind the scenes.parallel.citations.enabled: exposes Basis-style citations in the response metadata.
Node.js streaming example
import fetch from "node-fetch";
async function run() {
const response = await fetch("https://api.parallel.ai/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.PARALLEL_API_KEY}`,
"Content-Type": "application/json",
"Accept": "text/event-stream"
},
body: JSON.stringify({
model: "parallel-chat-core",
stream: true,
messages: [
{
role: "system",
content: "You are a research assistant that always cites web sources."
},
{
role: "user",
content: "Compare Parallel’s web retrieval to Tavily and Exa, and include citations."
}
],
parallel: {
web_grounding: { enabled: true },
citations: { enabled: true }
}
})
});
if (!response.ok || !response.body) {
console.error(`HTTP ${response.status}: ${await response.text()}`);
return;
}
const decoder = new TextDecoder();
const reader = response.body.getReader();
let buffer = "";
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// SSE lines separated by \n\n
const parts = buffer.split("\n\n");
buffer = parts.pop() || "";
for (const part of parts) {
if (!part.startsWith("data:")) continue;
const data = part.slice(5).trim();
if (data === "[DONE]") {
console.log("\n[stream complete]");
return;
}
const chunk = JSON.parse(data);
const delta = chunk.choices?.[0]?.delta?.content || "";
process.stdout.write(delta);
}
}
}
run().catch(console.error);
This will stream the assistant’s text content token-by-token. Citations and Basis metadata are delivered alongside the main content (explained below).
Surfacing Citations from Streaming Responses
Parallel’s OpenAI-compatible streaming returns chunks with the familiar structure:
{
"id": "chatcmpl-123",
"object": "chat.completion.chunk",
"created": 1710000000,
"model": "parallel-chat-core",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"content": "Parallel’s web retrieval benchmarks show..."
},
"finish_reason": null
}
]
}
When citations and Basis metadata are enabled, you’ll see additional fields in:
- The final non-streaming object (if you request a non-streaming completion), or
- A final summary event / ancillary metadata structure, depending on your configuration.
A typical pattern:
{
"parallel": {
"basis": {
"facts": [
{
"id": "fact-1",
"claim": "Parallel achieves higher recall than Tavily and Exa at comparable CPM on DeepResearch Bench.",
"citations": [
{
"url": "https://parallel.ai/benchmarks/deepresearch",
"excerpt": "Parallel outperforms Tavily and Exa across recall and answer accuracy at each cost tier.",
"confidence": 0.92
}
],
"confidence": 0.9,
"rationale": "Benchmarked on DeepResearch Bench with judge model GPT-4, testing window Jan–Feb 2026."
}
]
}
}
}
Practical integration patterns
You have two main options:
-
Render text as it streams, attach sources at the end
- Stream
delta.contentinto your chat UI. - Once the stream completes, fetch the
parallel.basisblock from the final payload and render the citations under the message (“Sources: …”).
- Stream
-
Programmatic fact-level handling
- After completion, iterate
parallel.basis.factsand:- Drop or flag facts below a confidence threshold (e.g.,
< 0.6) - Store
id,claim,citationsin your own datastore for audits - Allow users to click “Why?” and display
rationale
- Drop or flag facts below a confidence threshold (e.g.,
- After completion, iterate
Because streaming chunks are optimized for incremental text, Basis metadata is typically best consumed after the stream completes, not per token.
Controlling Web Grounding Behavior
You can tune how aggressively Parallel pulls web context via the parallel.web_grounding block.
Example configuration
"parallel": {
"web_grounding": {
"enabled": true,
"search": {
"max_calls": 3,
"processor": "search-core",
"latency_budget_seconds": 5
},
"extract": {
"max_pages": 5,
"processor": "extract-base"
}
},
"citations": {
"enabled": true,
"include_excerpts": true
}
}
Common tuning levers:
-
search.max_calls- Fewer calls → lower cost, lower recall
- More calls → deeper coverage, more tokens in context
-
Processor selection (
search-core,extract-base, etc.)- Parallel’s Processor architecture lets you trade latency vs depth:
- Lite/Base → faster, cheaper, shallower
- Core/Pro/Ultra → richer multi-hop reasoning, more context, higher CPM
- Parallel’s Processor architecture lets you trade latency vs depth:
-
Latency budget
- Set
latency_budget_secondsto keep search under a known ceiling (e.g., 5s) while streaming.
- Set
This lets you treat web grounding as a bounded, predictable component of your chat runtime, rather than an open-ended browsing session.
Using Chat with Your Own Agent Loop and Tools
If you’re running a more complex agent with tools (MCP tools, function calling, etc.), you can still keep web grounding inside Parallel instead of building your own search → scrape → parse → re-rank stack.
Two common patterns:
1. Chat as your main orchestrator
Let Parallel Chat call web tools internally:
parallel.web_grounding.enabled: true- Keep your external tools focused on non-web tasks (e.g., DB reads, internal APIs)
- Chat produces an answer plus Basis citations with minimal tool orchestration on your side
2. External agent orchestrating Parallel APIs directly
Use Parallel Chat strictly as a “reasoning and synthesis” layer:
- Your orchestrator calls Parallel Search directly (
/v1/search) to fetch URLs + compressed excerpts. - Optionally call Extract to expand specific URLs.
- Call Chat with:
stream: truemessagescontaining:- User question
- A system or tool message containing the retrieved snippets
- Use Chat purely for streaming synthesis and answer generation, with citations coming from the evidence you’ve injected.
This second pattern is useful when you want explicit control over which sources the model sees.
Example: Building a Web-Grounded QA Endpoint
A typical production flow looks like:
- Frontend sends
questionto your backend. - Backend calls Parallel Chat streaming endpoint with
web_grounding.enabled: true. - Backend streams tokens to the client and captures the final Basis metadata.
- Backend stores:
- Question
- Final answer
- Basis facts and citations
- Confidence scores
Pseudocode (TypeScript-style)
type ChatResult = {
text: string;
basis?: any;
};
async function webGroundedChat(question: string): Promise<ChatResult> {
const response = await fetch("https://api.parallel.ai/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.PARALLEL_API_KEY}`,
"Content-Type": "application/json",
"Accept": "text/event-stream"
},
body: JSON.stringify({
model: "parallel-chat-core",
stream: true,
messages: [
{ role: "system", content: "You are an AI assistant that answers with citations." },
{ role: "user", content: question }
],
parallel: {
web_grounding: { enabled: true },
citations: { enabled: true }
}
})
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
let fullText = "";
let finalBasis: any = null;
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const parts = buffer.split("\n\n");
buffer = parts.pop() || "";
for (const part of parts) {
if (!part.startsWith("data:")) continue;
const data = part.slice(5).trim();
if (data === "[DONE]") {
break;
}
const chunk = JSON.parse(data);
const choice = chunk.choices?.[0];
// Stream text
if (choice?.delta?.content) {
const deltaText = choice.delta.content;
fullText += deltaText;
// optionally forward deltaText to client
}
// Capture Basis metadata if this chunk contains it
if (chunk.parallel?.basis) {
finalBasis = chunk.parallel.basis;
}
}
}
return { text: fullText, basis: finalBasis };
}
You can then:
- Render
textin your chat UI - Render citations from
basis.facts[].citations - Use
basis.facts[].confidencefor QA thresholds
Latency and Cost Considerations
Because this endpoint is built for agents in production, you should design around known bands:
- Search: typically under 5 seconds per call
- Extract: cached pages ~1–3 seconds; live crawl ~60–90 seconds worst case
- Chat completion: model-dependent, plus retrieval time
With the Processor architecture:
- You allocate compute based on task:
- e.g.,
search-lite+chat-basefor quick Q&A - or
search-pro+chat-ultrafor deep research
- e.g.,
- You pay per request, not per token, which makes CPM predictable across runs.
When web grounding is enabled, you can bound total retrieval cost by:
- Limiting search calls (
max_calls) - Restricting extract pages
- Fixing processor tiers
This is critical if you’re coming from OpenAI-style “browsing + summarization” stacks where both time and spend are hard to forecast.
Debugging and Verifying Grounding
To ensure your streaming chat is actually web-grounded and verifiable:
-
Inspect Basis output in dev
- Log
parallel.basisfor each request during development. - Check that citations are from expected domains and carry sensible confidence scores.
- Log
-
Add a “show sources” toggle in your UI
- Let users expand a panel showing:
- Fact text
- Source URLs
- Excerpts and rationale
- Let users expand a panel showing:
-
Monitor recall via spot-check tasks
- Periodically send evaluation prompts (e.g., from DeepResearch Bench or WISER-Atomic style tasks).
- Verify that the cited sources match the facts and that low-confidence items are appropriately flagged.
-
Set confidence-based policies
- Example: if any critical field has confidence < 0.7, require a second pass or human review.
This keeps your streaming chat not just fast and web-aware, but auditable.
Final Verdict
If you need streaming chat that is both OpenAI-compatible and genuinely grounded on the web with citations, Parallel’s Chat API is designed for exactly that. You keep the ergonomics of /v1/chat/completions and SSE, but move onto an AI-native web stack where:
- Retrieval is programmatic and bounded by clear cost/latency parameters.
- Answers carry Basis metadata—citations, rationale, and confidence—for every atomic fact.
- You avoid maintaining your own search → crawl → scrape → re-rank pipeline.
Use the parallel.web_grounding and parallel.citations blocks to dial in behavior, stream tokens directly into your UI, and consume Basis metadata once the stream completes to attach verifiable sources.