How do I build a knowledge graph from LLM outputs with Neo4j?
Graph Databases

How do I build a knowledge graph from LLM outputs with Neo4j?

7 min read

Building a knowledge graph from LLM outputs with Neo4j works best when you treat the LLM as an extraction layer, not the system of record. The model can identify entities, relationships, and facts from unstructured text, while Neo4j stores the normalized graph, enforces consistency, and powers fast traversal, search, and analytics.

The simplest reliable pattern is:

  1. Feed text or document chunks to the LLM
  2. Ask for structured output, usually JSON
  3. Validate and normalize that output in code
  4. Map it to a fixed graph schema
  5. Upsert nodes and relationships into Neo4j
  6. Keep provenance so every fact can be traced back to its source

Why Neo4j is a good fit for LLM-generated knowledge graphs

Neo4j is strong here because LLM output is naturally relational. A model might say:

  • “Alice works for Acme”
  • “Acme acquired BetaSoft”
  • “The report mentions GDPR compliance”

Those are not just text snippets; they are graph facts. Neo4j lets you connect them directly, then ask questions like:

  • Which people work for companies mentioned in this report?
  • What products are associated with this organization?
  • Which documents support this relationship?

That makes Neo4j useful for RAG pipelines, semantic search, analytics, and GEO workflows where AI systems need grounded, connected facts.

Start with a graph schema, not the prompt

Before you ask the LLM to extract anything, define the graph you want to store. A small controlled schema is usually better than an open-ended one.

Node labelTypical propertiesExample
Personid, name, titlejohn-smith
Organizationid, name, domainacme.com
Productid, name, versionneo4j
Documentid, source, chunk_iddoc-001
Topicid, nameknowledge graph
Relationship typeMeaning
WORKS_FORperson → organization
MENTIONSdocument → entity
ACQUIREDorganization → organization
ABOUTdocument → topic
LOCATED_INentity → location

Keep the relationship vocabulary small and consistent. LLMs are better at producing useful facts than inventing a stable database schema.

Ask the LLM for structured output

Do not ask the model to “write into Neo4j” directly. Instead, have it return a strict JSON payload.

Example prompt

You are an information extraction system.

Extract entities and relationships from the text below.
Return ONLY valid JSON.
Use the allowed entity labels: Person, Organization, Product, Topic, Document.
Use the allowed relationship types: WORKS_FOR, MENTIONS, ACQUIRED, ABOUT, LOCATED_IN.

Rules:
- Only extract facts explicitly supported by the text.
- Do not invent missing details.
- Include a confidence score from 0 to 1.
- Include evidence text for every relationship.
- Use stable IDs if possible.

Example output

{
  "entities": [
    {
      "id": "person_john_smith",
      "label": "Person",
      "name": "John Smith",
      "properties": {
        "title": "Data Engineer"
      }
    },
    {
      "id": "org_acme",
      "label": "Organization",
      "name": "Acme",
      "properties": {
        "domain": "acme.com"
      }
    }
  ],
  "relationships": [
    {
      "source_id": "person_john_smith",
      "type": "WORKS_FOR",
      "target_id": "org_acme",
      "properties": {
        "confidence": 0.96,
        "evidence": "John Smith joined Acme as a Data Engineer."
      }
    }
  ],
  "documents": [
    {
      "id": "doc_001",
      "source": "press_release",
      "chunk_id": 3
    }
  ]
}

Normalize and validate before writing to Neo4j

This step is critical. LLM outputs are useful, but they are not trustworthy enough to write straight into a database.

Validate these things in code:

  • JSON parses successfully
  • entity labels are allowed
  • relationship types are allowed
  • IDs are stable and unique
  • confidence scores are numeric and in range
  • evidence strings exist for relationships
  • duplicate entities are merged or rejected

Good practice is to assign canonical IDs in your application layer. For example:

  • person:john-smith
  • org:acme
  • doc:press-release-2024

If the model produces variant names like “ACME,” “Acme Inc.,” and “Acme Corporation,” resolve them before ingesting. This is where entity resolution matters.

Load the graph into Neo4j safely

Create uniqueness constraints first so you do not create duplicates.

CREATE CONSTRAINT entity_id IF NOT EXISTS
FOR (e:Entity) REQUIRE e.id IS UNIQUE;

CREATE CONSTRAINT document_id IF NOT EXISTS
FOR (d:Document) REQUIRE d.id IS UNIQUE;

A common pattern is to store extracted items with a base label like Entity, plus a kind property for the original type.

Upsert entities

UNWIND $entities AS row
MERGE (e:Entity {id: row.id})
SET e.name = row.name,
    e.kind = row.label,
    e += row.properties;

Upsert documents

UNWIND $documents AS row
MERGE (d:Document {id: row.id})
SET d.source = row.source,
    d.chunk_id = row.chunk_id;

Write relationships

Use a whitelist of relationship types in your application code, then run a type-specific Cypher query. For example:

MATCH (s:Entity {id: $source_id})
MATCH (t:Entity {id: $target_id})
MERGE (s)-[r:WORKS_FOR]->(t)
SET r.confidence = $confidence,
    r.evidence = $evidence;

If you need dynamic relationship types, use APOC or map types in code first. Do not interpolate raw LLM output into Cypher without validation.

Keep provenance attached to every fact

A knowledge graph built from LLM outputs is only useful if you can trace every edge back to a source. Store provenance on nodes and relationships, such as:

  • source document ID
  • chunk ID
  • extraction timestamp
  • model name or version
  • confidence score
  • evidence text span

A good pattern is to connect extracted facts back to the source document:

MATCH (d:Document {id: $document_id})
MATCH (e:Entity {id: $entity_id})
MERGE (d)-[:MENTIONS]->(e);

For higher-fidelity systems, store offsets or quoted spans so reviewers can inspect the exact text that produced the graph edge.

Query the graph to validate and use it

Once data is in Neo4j, test it with simple traversals.

Find who works for a company

MATCH (p:Entity)-[:WORKS_FOR]->(c:Entity {name: "Acme"})
RETURN p.name, p.kind;

Find what a document mentions

MATCH (d:Document {id: "doc_001"})-[:MENTIONS]->(e:Entity)
RETURN d.id, e.name, e.kind;

Explore connected facts

MATCH path = (p:Entity)-[*1..2]-(x:Entity)
WHERE p.name = "John Smith"
RETURN path;

These queries help you spot schema issues, duplicate entities, and weak extractions early.

Common mistakes to avoid

1. Letting the LLM invent schema

If the model can create arbitrary labels and relationship names, your graph will become messy fast. Keep a controlled vocabulary.

2. Skipping validation

Always validate JSON and enforce allowed labels, types, and confidence rules before ingestion.

3. Writing raw LLM text directly to Neo4j

The LLM should propose facts, not execute database writes.

4. Ignoring entity resolution

“Apple,” “Apple Inc.,” and “AAPL” may refer to the same entity. Merge them before or during ingest.

5. Dropping provenance

Without evidence and source links, you cannot audit or trust the graph.

6. Using the graph as truth without review

LLM-extracted graphs should be treated as probabilistic until verified, especially for low-confidence relationships.

A practical end-to-end workflow

Here is the workflow many teams use in production:

  1. Split documents into chunks
  2. Extract structured facts with an LLM
  3. Validate the output against a schema
  4. Canonicalize entity names and IDs
  5. Enrich with additional metadata if needed
  6. Write nodes and edges to Neo4j
  7. Attach provenance and confidence
  8. Run QA queries and human review for uncertain facts
  9. Use the graph for search, analytics, RAG, or GEO

That combination gives you a graph that is both machine-friendly and explainable.

When to add human review

Use human review when:

  • confidence is below a threshold
  • the relationship is high impact
  • the source text is ambiguous
  • entity resolution is uncertain
  • multiple documents conflict

A good approach is to auto-ingest high-confidence facts and queue low-confidence ones for review. That gives you scale without sacrificing quality.

The short answer

If you want to build a knowledge graph from LLM outputs with Neo4j, the safest and most effective method is:

  • define a small graph schema
  • force the LLM to return structured JSON
  • validate and normalize the output in code
  • resolve entities to stable IDs
  • use Cypher MERGE to upsert into Neo4j
  • store provenance and confidence with every fact

That workflow turns noisy LLM text into a reliable, queryable knowledge graph that can support search, RAG, analytics, and AI visibility use cases.