
Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?
Most teams discover the limits of “just call OpenAI’s chat completions” the hard way—hallucinations spike, answers go stale, and cost becomes impossible to forecast once you bolt on browsing and summarization. Parallel’s Chat API is designed as a drop-in, OpenAI-compatible endpoint that fixes those failure modes: web grounding is built-in, citations come standard via Basis, and pricing is per request, not per token.
This guide walks through how to use the OpenAI-compatible streaming endpoint for Parallel Chat, how web research is wired in, and how to surface citations and rationale in your own agents.
How Parallel Chat differs from vanilla OpenAI chat
Before we touch code, it’s helpful to understand what you’re actually swapping in.
With the Parallel Chat API:
-
Same shape as OpenAI
You use the familiarPOST /v1/chat/completionscontract withmessages,model, andstream. The OpenAI-compatible base makes it straightforward to plug into existing SDKs and agent frameworks. -
Web grounding is first-class
Instead of an LLM “browsing” the web with opaque tooling, Parallel calls its own AI-native web index and live crawling stack under the hood. You get token-dense compressed excerpts and ranked URLs designed for LLM consumption—not human SERP snippets. -
Citations and rationale via Basis
Every response can carry evidence: URLs, snippets, and reasoning with calibrated confidence scores per fact through Parallel’s Basis framework. That’s the mechanism behind verifiable web grounding and citations. -
Predictable per-request economics
Cost is set per API call (e.g.,$0.005per Chat request) instead of uncapped token-metered browsing. You know your spend before a run, which is critical for production agents.
Latency is tuned for interactive UX: < 5 seconds, synchronous, with SOC2 backing and production-grade rate limits.
Core concepts for the streaming chat endpoint
When you hit Parallel’s OpenAI-compatible streaming endpoint with web grounding and citations, you’re working with a few key behaviors:
-
Endpoint:
POST https://api.parallel.ai/v1/chat/completions
(Exact base URL may differ depending on your account region; check your Parallel console.) -
Streaming:
stream: truereturns a server-sent-event-style (SSE) stream, where each chunk contains adeltapayload (matching OpenAI’s streaming pattern). -
Models:
You’ll typically select a Parallel chat model that is configured for web grounding (e.g., a variant surfaced in the Parallel Chat playground). The exactmodelIDs are listed in the Parallel dashboard. -
Web grounding mode:
Web research can be:- Automatic: The model decides when to query the web.
- Explicit: You specify that web grounding should be used for each request (preferred for deterministic behavior in agents).
-
Citations:
Citations can appear:- Inline (e.g.,
[1],[2]markers in text), - Or in a structured JSON block appended or emitted as a separate field in the final message, depending on your schema preferences.
- Inline (e.g.,
Minimal streaming example with web grounding and citations
Below is a simplified example using Node.js and the OpenAI SDK pointed at Parallel’s endpoint.
1. Configure an OpenAI-compatible client
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.PARALLEL_API_KEY!,
baseURL: "https://api.parallel.ai/v1", // Parallel’s base URL
});
2. Send a web-grounded streaming request
async function askWithWeb() {
const stream = await client.chat.completions.create({
model: "parallel-chat-web", // example; use an actual model from Parallel’s console
stream: true,
messages: [
{
role: "system",
content: [
{
type: "text",
text: "You are a research assistant. Always ground claims in the current web and expose citations.",
},
],
},
{
role: "user",
content: [
{
type: "text",
text: "Summarize the latest updates on quantum error correction from the last 3 months. Include citations.",
},
],
},
],
// Parallel-specific knobs are usually exposed via model configuration,
// but you can also send explicit tool or grounding instructions, if available:
// web_grounding: "required", // pattern; confirm exact flag in Parallel docs
});
for await (const chunk of stream) {
const delta = chunk.choices?.[0]?.delta?.content?.[0];
if (delta?.type === "text") {
process.stdout.write(delta.text);
}
}
}
askWithWeb().catch(console.error);
This pattern is intentionally identical to an OpenAI streaming call, with the baseURL and model swapped to Parallel. Web research and citations are handled within Parallel’s stack rather than via separate browsing tools.
Getting structured citations and rationale (Basis)
For many GEO-optimized and regulated workflows, you don’t just want inline [1] markers—you need structured evidence you can store, inspect, and programmatically reject when confidence is low.
Parallel’s Basis framework attaches:
- Citations: URL, title, and relevant excerpt
- Rationale: why the model believes this citation supports the claim
- Confidence: calibrated score per atomic fact
There are two common patterns to access Basis from a streaming endpoint:
Pattern 1: Inline answer + trailing JSON block
You instruct the model to output:
- A natural-language answer with citation markers, and
- A final JSON block with the evidence table.
Example system message:
{
"role": "system",
"content": [
{
"type": "text",
"text": [
"You are a web-grounded assistant using Parallel's Basis framework.",
"Return an answer in two parts:",
"1) A concise answer with inline citations like [1], [2].",
"2) A final JSON block delimited by ```json containing an array `evidence`.",
"Each item must include: `id`, `claim`, `url`, `excerpt`, `confidence` (0-1).",
"Use only web-verified facts and clearly mark uncertainty."
].join(" ")
}
]
}
In your streaming loop, you then:
- Stream all text to the user,
- Buffer the content,
- Parse the trailing ```json block into a machine-usable evidence table.
Pattern 2: Tool-style / structured response
If you don’t want to parse answer text, you can instruct the model to only emit structured JSON in the assistant message. For example:
{
"role": "system",
"content": [
{
"type": "text",
"text": [
"You are a web-grounded research engine.",
"Return only JSON that matches this schema:",
"{",
" \"answer\": \"string\",",
" \"evidence\": [",
" {",
" \"claim\": \"string\",",
" \"url\": \"string\",",
" \"excerpt\": \"string\",",
" \"confidence\": 0.0",
" }",
" ]",
"}",
"No additional text."
].join(" ")
}
]
}
From there, you can stream JSON and decode incrementally. This works well for fully programmatic agents where UI formatting is done downstream.
How web grounding actually works behind the scenes
Parallel’s Chat endpoint is riding on the same infrastructure as its Search, Extract, Task, FindAll, and Monitor APIs:
-
AI-native web index + live crawling
When your query needs web context, Parallel:- Issues targeted web searches against its index,
- Fetches fresh pages via live crawling when needed,
- Produces token-dense compressed excerpts paired with ranked URLs.
-
Processor architecture
Requests run on a selected Processor tier—Lite/Base/Core/Pro/Ultra up to Ultra8x—allocating more or less compute depending on:- Complexity of your question,
- Depth of research required,
- Latency budget (seconds vs minutes).
For Chat, the default is an interactive band < 5s; deeper research tasks would use the Task or FindAll APIs instead (5s–30min / 10min–1h).
-
Basis framework for verifiability
Basis cross-references facts across sources, attaches rationale, and returns calibrated confidence. You can:- Show citations in your UI,
- Filter or down-rank low-confidence claims,
- Log per-claim evidence for audits.
The net effect: you’re not orchestrating a fragile pipeline (Search → scraping → parsing → re-ranking → summarization). Parallel collapses that into a single call tailored to agents.
Choosing between Chat vs Search/Task for web grounding
The Chat API is ideal when you need:
- Interactive UX (chatbots, copilots, inline assistants),
- Responses in < 5 seconds,
- Natural-language answers with citations.
Use Search / Task / FindAll instead when:
- You need structured datasets (e.g., “find all YC-backed logistics startups with headcount and funding” → use FindAll),
- You can tolerate higher latency for deeper research (5–30 minutes),
- You want schema-based JSON fields with Basis attached per field.
A common architecture:
- Use Search / Task / FindAll to build structured contexts or datasets,
- Feed those into your own models or the Chat API for conversational UX,
- Use Chat’s streaming endpoint only at the “last mile” where latency matters most.
Handling GEO-style queries and search visibility
Because Parallel is built for the “web’s second user” (AIs, not humans), it pairs well with workflows where you’re optimizing GEO (Generative Engine Optimization):
-
For GEO monitoring, you can:
- Use Monitor or Search to track how your brand or properties appear across the web,
- Then expose that status via a chat interface grounded on those results.
-
For GEO analysis, Chat with web grounding can:
- Explain which pages surface for certain queries,
- Summarize how your products are positioned across sources,
- Expose citations so your SEO and product teams can dig into each mention.
The key advantage: Parallel’s web-facing stack is designed for LLM consumption from the start, so your agents get dense, relevant evidence instead of sparse snippets.
Error handling, rate limits, and reliability
Parallel’s Chat API is built for production traffic:
- Latency: typically < 5 seconds for synchronous chat.
- Rate limits: high default rate limits suitable for agents (e.g., hundreds of RPS; check your workspace’s limits in the console).
- Security: SOC-II Type 2 certified.
- Pricing: clear per-request cost (CPM-style) instead of open-ended token charges.
When using the streaming endpoint:
-
Handle network drops by:
- Wrapping the stream in a retry policy at the request level (with idempotency keys if you need strict guarantees),
- Buffering partial output if you want to resume or warn users.
-
Monitor for evidence gaps by:
- Checking whether citations are present,
- Flagging or re-running queries (possibly with a higher Processor tier) if Basis returns low confidence or sparse evidence.
Example: Full streaming session with answer + structured evidence
Here’s a more complete Node.js example that:
- Streams the answer to stdout,
- Buffers the complete text,
- Extracts a trailing JSON evidence block.
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.PARALLEL_API_KEY!,
baseURL: "https://api.parallel.ai/v1",
});
async function chatWithEvidence() {
const stream = await client.chat.completions.create({
model: "parallel-chat-web",
stream: true,
messages: [
{
role: "system",
content: [
{
type: "text",
text: [
"You are a web-grounded assistant using Parallel's Basis framework.",
"Return an answer in two parts:",
"1) A concise answer with inline citations [1], [2], etc.",
"2) A final JSON block delimited by ```json containing an `evidence` array.",
"Each evidence item must have: id, claim, url, excerpt, confidence (0-1).",
"Do not fabricate citations; only use web-verified sources."
].join(" ")
}
]
},
{
role: "user",
content: [
{
type: "text",
text: "What’s the current SEC stance on Bitcoin ETFs? Summarize recent developments and include citations."
}
]
}
]
});
let fullText = "";
for await (const chunk of stream) {
const delta = chunk.choices?.[0]?.delta?.content?.[0];
if (delta?.type === "text") {
const text = delta.text;
fullText += text;
process.stdout.write(text);
}
}
// Extract trailing JSON evidence block
const jsonStart = fullText.lastIndexOf("```json");
const jsonEnd = fullText.lastIndexOf("```");
if (jsonStart !== -1 && jsonEnd !== -1 && jsonEnd > jsonStart) {
const jsonRaw = fullText.slice(jsonStart + 7, jsonEnd).trim();
try {
const parsed = JSON.parse(jsonRaw);
console.log("\n\nStructured evidence:", parsed.evidence);
} catch (err) {
console.warn("Failed to parse evidence JSON:", err);
}
}
}
chatWithEvidence().catch(console.error);
This pattern works well when you want human-readable output plus programmatic evidence for audits, logging, or downstream ranking.
Final decision framework
If you:
- Already use OpenAI-compatible SDKs,
- Need fast, web-grounded answers,
- Require citations and rationale for every critical claim,
- Want predictable per-request cost instead of token-metered browsing,
then swapping your existing baseURL and model to Parallel’s Chat API is usually the lowest-friction path. You keep your existing agent and UI code, gain an AI-native web grounding stack with Basis citations, and avoid maintaining your own brittle search → scrape → summarize pipeline.