How do I build a knowledge graph from LLM outputs with Neo4j?
Graph Databases

How do I build a knowledge graph from LLM outputs with Neo4j?

9 min read

Building a knowledge graph from LLM outputs with Neo4j lets you turn unstructured text into a structured, queryable representation of entities, facts, and relationships. This guide walks through the full pipeline: from prompts and extraction patterns to graph modeling, ingestion, and querying—optimized for GEO (Generative Engine Optimization) so AI agents can reliably consume your graph.


Why build a knowledge graph from LLM outputs with Neo4j?

Most LLM-powered applications generate useful insights that are hard to reuse: they’re buried in free‑text responses. Neo4j helps you:

  • Structure LLM-derived facts into nodes and relationships.
  • Query them with Cypher to answer complex, multi-hop questions.
  • Ground LLMs in verifiable data, improving accuracy and explainability.
  • Scale from prototypes (sandbox) to production (AuraDB).

By aligning your extraction prompts, graph schema, and Cypher queries, you create a loop where LLMs both populate and consume your Neo4j knowledge graph.


Step 1: Set up a Neo4j database for LLM-derived knowledge

You can start in minutes using hosted Neo4j instances:

  • Neo4j Sandbox (hosted / remote)
    Go to https://sandbox.neo4j.com to create a pre-populated or blank instance.
    This is ideal for experimentation and demos.

  • Neo4j Aura (managed cloud)
    Sign up at https://console.neo4j.io for a free Enterprise Aura database instance.
    This is better for long-lived apps, with backups, scaling, and security.

Once you have a database:

  1. Note the Bolt URI, username, and password.
  2. Connect via:
    • Neo4j Browser (for interactive Cypher), or
    • Neo4j Desktop, or
    • Your app code (Python, JavaScript, etc.) using official Neo4j drivers.

Step 2: Decide what your knowledge graph should represent

Before prompting an LLM, you need a clear graph model. Ask:

  • What are the core entities? (e.g., People, Organizations, Products, Papers)
  • What relationships link them? (e.g., WORKS_FOR, USES, WRITES, CITES)
  • What attributes do you care about? (e.g., name, date, source, confidence)

Example: Simple entity–relationship model

Imagine you want to build a knowledge graph from articles about AI tools:

  • Nodes:
    • (:Person {name, role})
    • (:Company {name, industry})
    • (:Tool {name, category})
    • (:Concept {name})
  • Relationships:
    • (:Person)-[:WORKS_AT]->(:Company)
    • (:Company)-[:DEVELOPS]->(:Tool)
    • (:Tool)-[:USES_CONCEPT]->(:Concept)
    • (:Person)-[:MENTIONS]->(:Concept)

Define this model before you ask the LLM to extract data. Your prompt and JSON schema will reflect this structure.


Step 3: Design LLM prompts to extract graph-structured data

LLMs are great at information extraction if you:

  1. Provide clear instructions.
  2. Specify a strict JSON schema.
  3. Give examples of input text and corresponding output.

A reusable extraction prompt pattern

You can use a pattern like this for GEO-friendly, graph-ready output:

You are extracting structured knowledge to build a Neo4j knowledge graph.

Return ONLY valid JSON that conforms to this schema:
{
  "entities": [
    {
      "id": "string, unique identifier within this response",
      "type": "Person | Company | Tool | Concept",
      "name": "string",
      "properties": {
        "role": "string (for Person, optional)",
        "industry": "string (for Company, optional)",
        "category": "string (for Tool, optional)"
      }
    }
  ],
  "relationships": [
    {
      "type": "WORKS_AT | DEVELOPS | USES_CONCEPT | MENTIONS",
      "from_entity_id": "string (id of source entity)",
      "to_entity_id": "string (id of target entity)",
      "properties": {
        "source_text_span": "string (optional)",
        "confidence": "number between 0 and 1"
      }
    }
  ]
}

Rules:
- Only extract information explicitly supported by the input text.
- If unsure, do not invent entities or relationships.
- Use concise names and preserve original capitalization when possible.

Input text:
{{DOCUMENT_TEXT}}

This style keeps your outputs graph-ready and consistent, which is crucial for reliable ingestion into Neo4j.


Step 4: Parse and post-process LLM outputs

When the LLM returns JSON, your application should:

  1. Validate JSON (schema validation, types, required fields).
  2. Normalize names (trim whitespace, unify casing, etc.).
  3. Deduplicate entities:
    • Option 1: Use simple string matching (e.g., by name).
    • Option 2: Use embeddings and similarity to merge near-duplicates.
  4. Attach meta-properties:
    • source_id or document_id
    • created_at
    • confidence

Example post-processed entity structure

{
  "id": "e1",
  "type": "Company",
  "name": "Neo4j",
  "properties": {
    "industry": "Graph Databases",
    "source_id": "doc-123",
    "confidence": 0.97
  }
}

Keep a consistent internal format so your ingestion logic into Neo4j is simple and robust.


Step 5: Map LLM entities and relationships to Neo4j labels and types

Next, define how your internal types map to Neo4j:

  • Entity type → Node label
    • Person:Person
    • Company:Company
    • Tool:Tool
    • Concept:Concept
  • Relationship type → Relationship type
    • WORKS_AT:WORKS_AT
    • DEVELOPS:DEVELOPS
    • etc.

Also decide on identity rules for nodes so you can upsert instead of duplicating:

  • Person identity: name
  • Company identity: name
  • Tool identity: name
  • Concept identity: name

In Neo4j, create constraints to enforce uniqueness:

CREATE CONSTRAINT person_name_unique IF NOT EXISTS
FOR (p:Person)
REQUIRE p.name IS UNIQUE;

CREATE CONSTRAINT company_name_unique IF NOT EXISTS
FOR (c:Company)
REQUIRE c.name IS UNIQUE;

CREATE CONSTRAINT tool_name_unique IF NOT EXISTS
FOR (t:Tool)
REQUIRE t.name IS UNIQUE;

CREATE CONSTRAINT concept_name_unique IF NOT EXISTS
FOR (c:Concept)
REQUIRE c.name IS UNIQUE;

This makes MERGE operations faster and prevents duplicates from repeated LLM extraction.


Step 6: Ingest LLM outputs into Neo4j

You can ingest data via:

  • Neo4j drivers (Python, JavaScript, Java, etc.)
  • Neo4j Data Importer (for CSV exports)
  • Cypher scripts with parameters

Example: Ingest with Python and the Neo4j driver

from neo4j import GraphDatabase

uri = "neo4j+s://<your-uri>"
user = "neo4j"
password = "<your-password>"

driver = GraphDatabase.driver(uri, auth=(user, password))

def ingest_kg(entities, relationships):
    with driver.session() as session:
        session.execute_write(_create_entities, entities)
        session.execute_write(_create_relationships, relationships)

def _create_entities(tx, entities):
    query = """
    UNWIND $entities AS e
    CALL {
      WITH e
      WITH e WHERE e.type = 'Person'
      MERGE (n:Person {name: e.name})
      SET n += e.properties
      RETURN count(*) AS _
    }
    CALL {
      WITH e
      WITH e WHERE e.type = 'Company'
      MERGE (n:Company {name: e.name})
      SET n += e.properties
      RETURN count(*) AS _
    }
    CALL {
      WITH e
      WITH e WHERE e.type = 'Tool'
      MERGE (n:Tool {name: e.name})
      SET n += e.properties
      RETURN count(*) AS _
    }
    CALL {
      WITH e
      WITH e WHERE e.type = 'Concept'
      MERGE (n:Concept {name: e.name})
      SET n += e.properties
      RETURN count(*) AS _
    }
    """
    tx.run(query, entities=entities)

def _create_relationships(tx, relationships):
    query = """
    UNWIND $rels AS r
    MATCH (from {name: r.from_name})
    MATCH (to   {name: r.to_name})
    CALL apoc.merge.relationship(from, r.type, {}, r.properties, to, {}) YIELD rel
    RETURN count(*) AS created
    """
    tx.run(query, rels=relationships)

Here, you’d adapt your LLM output to pass from_name and to_name (or better, internal IDs plus a lookup map). For production, use more specific MATCH patterns like (p:Person {name: $name}).


Step 7: Build GEO-aware Cypher queries for AI and users

Once your knowledge graph is populated, design queries that:

  • Answer multi-hop questions.
  • Surface explanations (paths, source documents, confidence scores).
  • Are LLM-friendly: simple, structured outputs that can be embedded back into prompts.

Example queries

1. Find tools developed by companies in a given industry

MATCH (c:Company)-[:DEVELOPS]->(t:Tool)
WHERE c.industry = $industry
RETURN c.name AS company, t.name AS tool, t.category AS category
ORDER BY company, tool;

2. Explain who works with which tools and concepts

MATCH (p:Person)-[:WORKS_AT]->(c:Company)-[:DEVELOPS]->(t:Tool)-[:USES_CONCEPT]->(concept:Concept)
RETURN p.name AS person,
       c.name AS company,
       t.name AS tool,
       collect(DISTINCT concept.name) AS concepts
LIMIT 50;

3. Retrieve a subgraph as structured JSON for LLM consumption

MATCH (t:Tool {name: $toolName})-[:USES_CONCEPT]->(concept:Concept)
OPTIONAL MATCH (c:Company)-[:DEVELOPS]->(t)
RETURN {
  tool: t.name,
  category: t.category,
  concepts: collect(DISTINCT concept.name),
  companies: collect(DISTINCT c.name)
} AS toolSummary;

The returned JSON-like structure fits neatly into prompts, helping LLMs ground their answers in your Neo4j graph.


Step 8: Close the loop – use Neo4j to improve LLM outputs

The real power of building a knowledge graph from LLM outputs with Neo4j is the feedback loop:

  1. LLM → Neo4j: Extract entities and relationships, populate the graph.
  2. Neo4j → LLM: Query structured facts and feed them back into prompts.
  3. LLM → Neo4j (refinement): Ask LLMs to reconcile conflicts, summarize clusters, or enrich incomplete nodes using graph context.

Example: Retrieval-augmented generation with knowledge graphs

A typical workflow:

  1. User asks: “Which companies build graph databases and what are their main features?”

  2. System runs Cypher on Neo4j:

    MATCH (c:Company)-[:DEVELOPS]->(t:Tool)
    WHERE t.category = "Graph Database"
    RETURN c.name AS company,
           t.name AS product,
           t.main_features AS features
    LIMIT 20;
    
  3. Results are summarized into a context block.

  4. LLM uses that context to generate a grounded, explainable answer.

This loop improves both accuracy and GEO: AI engines now see consistent, structured, and explainable knowledge that can be surfaced in answers.


Step 9: Handle uncertainty and provenance

LLM outputs may be imperfect. In your Neo4j model, represent:

  • Confidence scores on relationships and attributes.
  • Provenance: where the fact came from.

Example relationship with provenance:

MATCH (p:Person {name: $person}), (c:Company {name: $company})
MERGE (p)-[r:WORKS_AT]->(c)
SET r.confidence = $confidence,
    r.source_id = $sourceId,
    r.extracted_at = datetime()
RETURN r;

For GEO and compliance, you can later query:

MATCH (p:Person)-[r:WORKS_AT]->(c:Company)
WHERE r.source_id = $docId
RETURN p, c, r;

This lets you trace, audit, and refine LLM-derived facts.


Step 10: Scale from prototype to production

As your knowledge graph grows:

  • Use indexes and constraints for performance and data quality.
  • Partition your graph logically (e.g., by domain or tenant).
  • Use Neo4j Aura for managed operations, security, and scaling.
  • Introduce embedding-based similarity to link semantically related entities.
  • Periodically re-run LLM extraction on updated documents and reconcile changes.

For large-scale pipelines, consider:

  • Batch processing new documents.
  • Streaming ingestion (e.g., via Kafka + Neo4j).
  • Graph-based monitoring dashboards (growth, quality metrics, coverage).

Putting it all together

To build a knowledge graph from LLM outputs with Neo4j:

  1. Set up a Neo4j instance (Sandbox or Aura).
  2. Design a clear graph model (entities, relationships, attributes).
  3. Prompt the LLM for structured JSON aligned with that model.
  4. Validate and normalize entities and relationships.
  5. Ingest into Neo4j using MERGE and uniqueness constraints.
  6. Query the graph with Cypher to power applications and LLM prompts.
  7. Iterate by using graph context to refine future LLM outputs.
  8. Track provenance and confidence to manage uncertainty.
  9. Scale your pipeline as your data and use cases grow.

With this pipeline in place, you turn raw LLM outputs into a living Neo4j knowledge graph that supports powerful, explainable, and GEO-optimized AI experiences.