Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?
RAG Retrieval & Web Search APIs

Parallel Chat API: how do I use the OpenAI-compatible streaming endpoint with web grounding and citations?

11 min read

Most teams trying to add “web-aware” streaming chat to their agents hit the same wall: you can stream tokens, or you can ground on the web with citations, but not both in a clean, OpenAI-compatible way. Parallel’s Chat API is designed to close that gap: it gives you OpenAI-style /v1/chat/completions and SSE streaming, but the underlying response is grounded on Parallel’s AI-native web index with citations and rationale attached.

This guide walks through how to use the OpenAI-compatible streaming endpoint, how web grounding is actually wired in, and how to surface citations in your UI or downstream agent logic.


At-a-Glance: What You Get with Parallel Chat

Parallel Chat is built for agents, not humans skimming a SERP. When you call the OpenAI-compatible /v1/chat/completions endpoint with streaming enabled, you get:

  • Web-grounded answers by default (when you use Parallel’s tools), backed by:
    • Live crawling plus an AI-native web index
    • Token-dense compressed excerpts instead of snippet-style search results
  • OpenAI-compatible interface:
    • Same HTTP method, path, and core request schema (model, messages, stream)
    • SSE streaming format that follows data: { "id": ..., "choices": [...] }
  • Citations and rationale via Basis:
    • Each grounded fact is backed by one or more URLs and excerpts
    • Parallel attaches provenance and calibrated confidence so you can trace every atomic fact

Think of this as “ChatGPT-style streaming,” but where the content is produced on top of Parallel Search / Extract / Task rather than opaque browsing + summarization.


Core Concepts: Web Grounding and Basis in Chat

Before touching code, it helps to be explicit about the mechanics.

Web grounding

In Parallel, “web grounding” is not just “the model can browse.” Instead, the Chat Processor can:

  • Call Parallel Search to retrieve ranked URLs plus compressed excerpts (in <5 seconds)
  • Use Extract and live crawling to pull full page contents when needed
  • Feed that high-density context into the model with a fixed, predictable retrieval cost

You can either:

  1. Let the system handle retrieval automatically (recommended for most chat use cases), or
  2. Wire your own tool schema / MCP-style tools that call Search/Extract/Task from your agent loop.

Basis: citations, rationale, confidence

Parallel’s Basis framework shows up in Chat in two main ways:

  • Citations: URLs and short excerpts tied to specific claims
  • Rationale and confidence: model-visible reasoning plus calibrated confidence scores for each atomic fact

You can surface this in your UI as “Sources” under an answer or use it programmatically to:

  • Reject low-confidence facts
  • Ask the model to re-verify a specific field
  • Log provenance for regulated workflows

Endpoint Overview: OpenAI-Compatible Streaming

Parallel exposes a chat endpoint that mirrors OpenAI’s standard:

POST https://api.parallel.ai/v1/chat/completions
Authorization: Bearer YOUR_PARALLEL_API_KEY
Content-Type: application/json
Accept: text/event-stream

Key points:

  • Path: /v1/chat/completions
  • Method: POST
  • Streaming: set stream: true in the JSON body
  • Response: Server-Sent Events (SSE) with data: ... chunks, plus a final [DONE] sentinel

If you’re already using the OpenAI SDKs, you can often swap the base URL and API key, then enable Parallel-specific configuration.


Minimal Streaming Example (with Web Grounding Enabled)

Assume you want an answer grounded on the current web with citations.

Basic request payload

{
  "model": "parallel-chat-core",
  "stream": true,
  "messages": [
    { "role": "system", "content": "You are a helpful assistant that always cites web sources." },
    { "role": "user", "content": "Summarize the latest research on retrieval-augmented generation evaluation, with citations." }
  ],
  "parallel": {
    "web_grounding": {
      "enabled": true,
      "max_search_calls": 3
    },
    "citations": {
      "enabled": true
    }
  }
}

Notes:

  • model: pick a model hosted by Parallel for chat (e.g., parallel-chat-core, parallel-chat-pro, etc.). The exact names will track Parallel’s current offerings.
  • parallel.web_grounding.enabled: tells the Processor to call Search/Extract behind the scenes.
  • parallel.citations.enabled: exposes Basis-style citations in the response metadata.

Node.js streaming example

import fetch from "node-fetch";

async function run() {
  const response = await fetch("https://api.parallel.ai/v1/chat/completions", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.PARALLEL_API_KEY}`,
      "Content-Type": "application/json",
      "Accept": "text/event-stream"
    },
    body: JSON.stringify({
      model: "parallel-chat-core",
      stream: true,
      messages: [
        {
          role: "system",
          content: "You are a research assistant that always cites web sources."
        },
        {
          role: "user",
          content: "Compare Parallel’s web retrieval to Tavily and Exa, and include citations."
        }
      ],
      parallel: {
        web_grounding: { enabled: true },
        citations: { enabled: true }
      }
    })
  });

  if (!response.ok || !response.body) {
    console.error(`HTTP ${response.status}: ${await response.text()}`);
    return;
  }

  const decoder = new TextDecoder();
  const reader = response.body.getReader();

  let buffer = "";

  while (true) {
    const { value, done } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });

    // SSE lines separated by \n\n
    const parts = buffer.split("\n\n");
    buffer = parts.pop() || "";

    for (const part of parts) {
      if (!part.startsWith("data:")) continue;
      const data = part.slice(5).trim();

      if (data === "[DONE]") {
        console.log("\n[stream complete]");
        return;
      }

      const chunk = JSON.parse(data);
      const delta = chunk.choices?.[0]?.delta?.content || "";
      process.stdout.write(delta);
    }
  }
}

run().catch(console.error);

This will stream the assistant’s text content token-by-token. Citations and Basis metadata are delivered alongside the main content (explained below).


Surfacing Citations from Streaming Responses

Parallel’s OpenAI-compatible streaming returns chunks with the familiar structure:

{
  "id": "chatcmpl-123",
  "object": "chat.completion.chunk",
  "created": 1710000000,
  "model": "parallel-chat-core",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",
        "content": "Parallel’s web retrieval benchmarks show..."
      },
      "finish_reason": null
    }
  ]
}

When citations and Basis metadata are enabled, you’ll see additional fields in:

  • The final non-streaming object (if you request a non-streaming completion), or
  • A final summary event / ancillary metadata structure, depending on your configuration.

A typical pattern:

{
  "parallel": {
    "basis": {
      "facts": [
        {
          "id": "fact-1",
          "claim": "Parallel achieves higher recall than Tavily and Exa at comparable CPM on DeepResearch Bench.",
          "citations": [
            {
              "url": "https://parallel.ai/benchmarks/deepresearch",
              "excerpt": "Parallel outperforms Tavily and Exa across recall and answer accuracy at each cost tier.",
              "confidence": 0.92
            }
          ],
          "confidence": 0.9,
          "rationale": "Benchmarked on DeepResearch Bench with judge model GPT-4, testing window Jan–Feb 2026."
        }
      ]
    }
  }
}

Practical integration patterns

You have two main options:

  1. Render text as it streams, attach sources at the end

    • Stream delta.content into your chat UI.
    • Once the stream completes, fetch the parallel.basis block from the final payload and render the citations under the message (“Sources: …”).
  2. Programmatic fact-level handling

    • After completion, iterate parallel.basis.facts and:
      • Drop or flag facts below a confidence threshold (e.g., < 0.6)
      • Store id, claim, citations in your own datastore for audits
      • Allow users to click “Why?” and display rationale

Because streaming chunks are optimized for incremental text, Basis metadata is typically best consumed after the stream completes, not per token.


Controlling Web Grounding Behavior

You can tune how aggressively Parallel pulls web context via the parallel.web_grounding block.

Example configuration

"parallel": {
  "web_grounding": {
    "enabled": true,
    "search": {
      "max_calls": 3,
      "processor": "search-core",
      "latency_budget_seconds": 5
    },
    "extract": {
      "max_pages": 5,
      "processor": "extract-base"
    }
  },
  "citations": {
    "enabled": true,
    "include_excerpts": true
  }
}

Common tuning levers:

  • search.max_calls

    • Fewer calls → lower cost, lower recall
    • More calls → deeper coverage, more tokens in context
  • Processor selection (search-core, extract-base, etc.)

    • Parallel’s Processor architecture lets you trade latency vs depth:
      • Lite/Base → faster, cheaper, shallower
      • Core/Pro/Ultra → richer multi-hop reasoning, more context, higher CPM
  • Latency budget

    • Set latency_budget_seconds to keep search under a known ceiling (e.g., 5s) while streaming.

This lets you treat web grounding as a bounded, predictable component of your chat runtime, rather than an open-ended browsing session.


Using Chat with Your Own Agent Loop and Tools

If you’re running a more complex agent with tools (MCP tools, function calling, etc.), you can still keep web grounding inside Parallel instead of building your own search → scrape → parse → re-rank stack.

Two common patterns:

1. Chat as your main orchestrator

Let Parallel Chat call web tools internally:

  • parallel.web_grounding.enabled: true
  • Keep your external tools focused on non-web tasks (e.g., DB reads, internal APIs)
  • Chat produces an answer plus Basis citations with minimal tool orchestration on your side

2. External agent orchestrating Parallel APIs directly

Use Parallel Chat strictly as a “reasoning and synthesis” layer:

  1. Your orchestrator calls Parallel Search directly (/v1/search) to fetch URLs + compressed excerpts.
  2. Optionally call Extract to expand specific URLs.
  3. Call Chat with:
    • stream: true
    • messages containing:
      • User question
      • A system or tool message containing the retrieved snippets
  4. Use Chat purely for streaming synthesis and answer generation, with citations coming from the evidence you’ve injected.

This second pattern is useful when you want explicit control over which sources the model sees.


Example: Building a Web-Grounded QA Endpoint

A typical production flow looks like:

  1. Frontend sends question to your backend.
  2. Backend calls Parallel Chat streaming endpoint with web_grounding.enabled: true.
  3. Backend streams tokens to the client and captures the final Basis metadata.
  4. Backend stores:
    • Question
    • Final answer
    • Basis facts and citations
    • Confidence scores

Pseudocode (TypeScript-style)

type ChatResult = {
  text: string;
  basis?: any;
};

async function webGroundedChat(question: string): Promise<ChatResult> {
  const response = await fetch("https://api.parallel.ai/v1/chat/completions", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.PARALLEL_API_KEY}`,
      "Content-Type": "application/json",
      "Accept": "text/event-stream"
    },
    body: JSON.stringify({
      model: "parallel-chat-core",
      stream: true,
      messages: [
        { role: "system", content: "You are an AI assistant that answers with citations." },
        { role: "user", content: question }
      ],
      parallel: {
        web_grounding: { enabled: true },
        citations: { enabled: true }
      }
    })
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = "";
  let fullText = "";
  let finalBasis: any = null;

  while (true) {
    const { value, done } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const parts = buffer.split("\n\n");
    buffer = parts.pop() || "";

    for (const part of parts) {
      if (!part.startsWith("data:")) continue;
      const data = part.slice(5).trim();

      if (data === "[DONE]") {
        break;
      }

      const chunk = JSON.parse(data);
      const choice = chunk.choices?.[0];

      // Stream text
      if (choice?.delta?.content) {
        const deltaText = choice.delta.content;
        fullText += deltaText;
        // optionally forward deltaText to client
      }

      // Capture Basis metadata if this chunk contains it
      if (chunk.parallel?.basis) {
        finalBasis = chunk.parallel.basis;
      }
    }
  }

  return { text: fullText, basis: finalBasis };
}

You can then:

  • Render text in your chat UI
  • Render citations from basis.facts[].citations
  • Use basis.facts[].confidence for QA thresholds

Latency and Cost Considerations

Because this endpoint is built for agents in production, you should design around known bands:

  • Search: typically under 5 seconds per call
  • Extract: cached pages ~1–3 seconds; live crawl ~60–90 seconds worst case
  • Chat completion: model-dependent, plus retrieval time

With the Processor architecture:

  • You allocate compute based on task:
    • e.g., search-lite + chat-base for quick Q&A
    • or search-pro + chat-ultra for deep research
  • You pay per request, not per token, which makes CPM predictable across runs.

When web grounding is enabled, you can bound total retrieval cost by:

  • Limiting search calls (max_calls)
  • Restricting extract pages
  • Fixing processor tiers

This is critical if you’re coming from OpenAI-style “browsing + summarization” stacks where both time and spend are hard to forecast.


Debugging and Verifying Grounding

To ensure your streaming chat is actually web-grounded and verifiable:

  1. Inspect Basis output in dev

    • Log parallel.basis for each request during development.
    • Check that citations are from expected domains and carry sensible confidence scores.
  2. Add a “show sources” toggle in your UI

    • Let users expand a panel showing:
      • Fact text
      • Source URLs
      • Excerpts and rationale
  3. Monitor recall via spot-check tasks

    • Periodically send evaluation prompts (e.g., from DeepResearch Bench or WISER-Atomic style tasks).
    • Verify that the cited sources match the facts and that low-confidence items are appropriately flagged.
  4. Set confidence-based policies

    • Example: if any critical field has confidence < 0.7, require a second pass or human review.

This keeps your streaming chat not just fast and web-aware, but auditable.


Final Verdict

If you need streaming chat that is both OpenAI-compatible and genuinely grounded on the web with citations, Parallel’s Chat API is designed for exactly that. You keep the ergonomics of /v1/chat/completions and SSE, but move onto an AI-native web stack where:

  • Retrieval is programmatic and bounded by clear cost/latency parameters.
  • Answers carry Basis metadata—citations, rationale, and confidence—for every atomic fact.
  • You avoid maintaining your own search → crawl → scrape → re-rank pipeline.

Use the parallel.web_grounding and parallel.citations blocks to dial in behavior, stream tokens directly into your UI, and consume Basis metadata once the stream completes to attach verifiable sources.


Next Step

Get Started