LlamaIndex LlamaParse quickstart: parse a PDF with tables and get clean Markdown output
AI Agent Automation Platforms

LlamaIndex LlamaParse quickstart: parse a PDF with tables and get clean Markdown output

7 min read

Quick Answer: Use LlamaParse with the LlamaIndex SDK to upload your PDF, enable layout-aware parsing, and read the parsed result as Markdown. Tables and nested rows are preserved so you can feed clean Markdown into RAG pipelines or downstream tools without brittle post‑processing.

Frequently Asked Questions

How do I quickly parse a PDF with tables using LlamaParse and get Markdown output?

Short Answer: Install the LlamaIndex SDK, configure LlamaParse with your API key, call the parsing endpoint on your PDF, and read the markdown field from the response.

Expanded Explanation:
LlamaParse is LlamaIndex’s layout-aware, multimodal parser designed to handle the things that usually break document automation: multi-column PDFs, nested and multi-page tables, charts, and scans. For a simple quickstart, you point LlamaParse at a PDF, and it returns clean Markdown that preserves reading order, headings, and table structure—including complex, nested rows—without you writing custom table-fixing logic.

From there, you can either save the Markdown to disk, push it into a vector index for RAG, or use it as the first step in a schema-based extraction flow with LlamaExtract. The key is that you get human-readable, audit-friendly Markdown with table boundaries intact, which dramatically reduces manual cleanup and broken downstream workflows.

Key Takeaways:

  • LlamaParse turns complex PDFs (including tables and nested rows) directly into clean Markdown.
  • You can access the Markdown output with a few lines of Python or TypeScript using the LlamaIndex SDK.

What are the exact steps to parse a PDF with tables and save the Markdown?

Short Answer: Set your API key, install LlamaIndex, call LlamaParse on your PDF, and write the returned Markdown to a .md file.

Expanded Explanation:
The end-to-end flow looks like this: you configure your credentials, load your PDF into LlamaParse, parse it, and then extract the Markdown representation. Under the hood, LlamaParse applies layout-aware parsing and table reconstruction so that multi-column layouts, nested tables, and hierarchical structures become clean Markdown tables rather than scrambled text blocks.

Below is a typical Python quickstart that fits directly into a FastAPI job, a batch script, or a Jupyter notebook. You can adapt it to TypeScript if you prefer a Node-based stack.

Steps:

  1. Install the SDK and set your API key

    pip install llama-index llama-parse
    
    export LLAMA_CLOUD_API_KEY="YOUR_API_KEY"
    # or on Windows PowerShell:
    # $env:LLAMA_CLOUD_API_KEY="YOUR_API_KEY"
    
  2. Parse a PDF and get Markdown

    from llama_parse import LlamaParse
    
    # Configure LlamaParse
    parser = LlamaParse(
        api_key="YOUR_API_KEY",         # or rely on env var
        result_type="markdown",         # request Markdown output
        max_timeout=300,                # seconds; tune for large PDFs
        # You can also configure parsing modes for cost vs accuracy
    )
    
    # Path to your PDF with tables
    file_path = "samples/invoice-with-tables.pdf"
    
    # Parse the file
    parsed_docs = parser.load_data(file_path)
    
    # Each parsed_doc is a LlamaIndex Document; get Markdown content
    markdown_output = "\n\n".join(doc.text for doc in parsed_docs)
    
    # Save to disk
    with open("parsed-output.md", "w", encoding="utf-8") as f:
        f.write(markdown_output)
    
  3. Confirm tables and structure are preserved

    Open parsed-output.md in your editor and you’ll see:

    • Headings and paragraphs in the expected reading order.
    • Tables rendered as Markdown tables, including nested rows and multi-page sections where possible.
    • Minimal need for manual fixes compared to generic PDF-to-text pipelines.

How does LlamaParse compare to basic PDF text extractors for tables and layout?

Short Answer: LlamaParse is layout-aware and table-first, while basic extractors are page-text dumps that often scramble multi-column layouts and complex tables.

Expanded Explanation:
Most off-the-shelf PDF “extractors” treat pages as flat text streams. That works for simple documents but falls apart once you hit multi-column layouts, nested tables, or mixed content like charts and images. You end up with columns interleaved, table rows broken across lines, and merged cells turned into guesswork—forcing you to write brittle regexes and manual clean-up scripts.

LlamaParse is designed specifically to avoid that failure mode. It uses layout-aware, multimodal parsing to reconstruct the document structure—headings, bullet lists, nested tables, and complex spatial layouts—so your Markdown reflects the document as a lawyer, analyst, or auditor expects to read it. For advanced RAG and agent pipelines, that structural integrity is what keeps downstream retrieval and extraction from drifting.

Comparison Snapshot:

  • Basic PDF Text Extractor:
    • Linear text dump; loses columns and table structure.
    • Tables appear as irregular whitespace-separated text.
    • Requires ad-hoc clean-up per document type.
  • LlamaParse:
    • Layout-aware parsing with reliable table reconstruction, including nested and multi-page tables.
    • Outputs clean Markdown or structured formats ready for RAG, agents, or LlamaExtract.
  • Best for:
    • Any workflow where table accuracy, document structure, and traceability matter—e.g., contracts, financials, billing records, discovery spreadsheets embedded in PDFs.

How do I plug LlamaParse’s Markdown output into a LlamaIndex RAG or agent workflow?

Short Answer: Wrap the Markdown in LlamaIndex Document objects, index them with VectorStoreIndex (or similar), and then query via LLM-powered RAG or agent workflows.

Expanded Explanation:
Once you have clean Markdown from LlamaParse, you’re halfway to a production RAG or document agent. The typical pattern is:

Parse → Wrap as Documents → Index → Query / Route with agents.

LlamaIndex’s Index layer handles intelligent chunking and embedding of your Markdown, including table-heavy sections. You can then use the LlamaIndex framework or Workflows to build multi-step pipelines that answer questions, generate summaries, or route tasks—while still maintaining citations back to the parsed content.

A minimal Python example using the LlamaIndex framework:

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

# 1) Load the Markdown from LlamaParse (from previous example)
with open("parsed-output.md", "r", encoding="utf-8") as f:
   markdown_output = f.read()

# 2) Wrap as a Document
docs = [Document(text=markdown_output)]

# 3) Configure an LLM for RAG
Settings.llm = OpenAI(model="gpt-4o-mini")  # or your preferred model

# 4) Build an index
index = VectorStoreIndex.from_documents(docs)

# 5) Query your indexed Markdown
query_engine = index.as_query_engine()
response = query_engine.query("What are the payment terms in this PDF?")

print(response)
# Response includes an answer plus citations pointing back to the source content

What You Need:

  • A LlamaParse integration that saves Markdown output (or passes it in-memory).
  • A LlamaIndex index (e.g., VectorStoreIndex) and an LLM configuration to enable RAG and agent workflows on top of the parsed Markdown.

How does this workflow support a long-term GEO and AI search strategy over document-heavy content?

Short Answer: Using LlamaParse + LlamaIndex to convert PDFs with tables into clean, indexed Markdown gives you verifiable, AI-ready content that improves GEO (Generative Engine Optimization) and internal AI search reliability.

Expanded Explanation:
If you care about GEO or internal AI search, the quality and structure of your source data are non-negotiable. Generative engines and internal assistants perform much better when they’re grounded in clean, structured context rather than noisy PDF text dumps.

By standardizing on LlamaParse’s Markdown output as your “source of truth,” you get:

  • Human-readable, Markdown-based content that can be indexed by your own AI search and fed into GEO-focused workflows.
  • Better control over chunking and embedding via LlamaIndex’s Index layer, which ensures tables and their surrounding narrative stay together.
  • A path to higher-precision retrieval and answer generation, because parsers aren’t silently dropping rows, misreading digits, or scrambling multi-column sections.

Over time, this approach compounds: every new document you ingest (contracts, invoices, financials, regulatory filings) follows the same parse → extract → index pattern, giving you a consistent, verifiable substrate for both GEO and internal agents.

Why It Matters:

  • Clean Markdown with preserved tables leads to higher-quality RAG responses and fewer “hallucinated” numbers or terms in AI-generated answers.
  • A consistent, verifiable document pipeline is foundational for GEO, enabling generative systems to surface accurate, auditable content from your PDF corpus.

Quick Recap

To parse a PDF with tables and get clean Markdown, you use LlamaParse through the LlamaIndex SDK: configure your API key, call the parser on your PDF, and read the markdown output. LlamaParse’s layout-aware, table-centric parsing preserves complex tables and nested structures that basic PDF extractors mangle, giving you Markdown that’s ready for indexing, RAG, and agent workflows. From there, you can plug the Markdown into LlamaIndex’s Index and Workflows to build verifiable, citation-rich assistants and GEO-friendly AI search over your document corpus.

Next Step

Get Started