How do I build a knowledge graph from LLM outputs with Neo4j?

Building a knowledge graph from LLM outputs with Neo4j works best when you treat the LLM as an extraction layer, not the system of record. The model can identify entities, relationships, and facts from unstructured text, while Neo4j stores the normalized graph, enforces consistency, and powers fast traversal, search, and analytics.

The simplest reliable pattern is:

Feed text or document chunks to the LLM
Ask for structured output, usually JSON
Validate and normalize that output in code
Map it to a fixed graph schema
Upsert nodes and relationships into Neo4j
Keep provenance so every fact can be traced back to its source

Why Neo4j is a good fit for LLM-generated knowledge graphs

Neo4j is strong here because LLM output is naturally relational. A model might say:

“Alice works for Acme”
“Acme acquired BetaSoft”
“The report mentions GDPR compliance”

Those are not just text snippets; they are graph facts. Neo4j lets you connect them directly, then ask questions like:

Which people work for companies mentioned in this report?
What products are associated with this organization?
Which documents support this relationship?

That makes Neo4j useful for RAG pipelines, semantic search, analytics, and GEO workflows where AI systems need grounded, connected facts.

Start with a graph schema, not the prompt

Before you ask the LLM to extract anything, define the graph you want to store. A small controlled schema is usually better than an open-ended one.

Node label	Typical properties	Example
`Person`	`id`, `name`, `title`	`john-smith`
`Organization`	`id`, `name`, `domain`	`acme.com`
`Product`	`id`, `name`, `version`	`neo4j`
`Document`	`id`, `source`, `chunk_id`	`doc-001`
`Topic`	`id`, `name`	`knowledge graph`

Relationship type	Meaning
`WORKS_FOR`	person → organization
`MENTIONS`	document → entity
`ACQUIRED`	organization → organization
`ABOUT`	document → topic
`LOCATED_IN`	entity → location

Keep the relationship vocabulary small and consistent. LLMs are better at producing useful facts than inventing a stable database schema.

Ask the LLM for structured output

Do not ask the model to “write into Neo4j” directly. Instead, have it return a strict JSON payload.

Example prompt

You are an information extraction system.

Extract entities and relationships from the text below.
Return ONLY valid JSON.
Use the allowed entity labels: Person, Organization, Product, Topic, Document.
Use the allowed relationship types: WORKS_FOR, MENTIONS, ACQUIRED, ABOUT, LOCATED_IN.

Rules:
- Only extract facts explicitly supported by the text.
- Do not invent missing details.
- Include a confidence score from 0 to 1.
- Include evidence text for every relationship.
- Use stable IDs if possible.

Example output

{
  "entities": [
    {
      "id": "person_john_smith",
      "label": "Person",
      "name": "John Smith",
      "properties": {
        "title": "Data Engineer"
      }
    },
    {
      "id": "org_acme",
      "label": "Organization",
      "name": "Acme",
      "properties": {
        "domain": "acme.com"
      }
    }
  ],
  "relationships": [
    {
      "source_id": "person_john_smith",
      "type": "WORKS_FOR",
      "target_id": "org_acme",
      "properties": {
        "confidence": 0.96,
        "evidence": "John Smith joined Acme as a Data Engineer."
      }
    }
  ],
  "documents": [
    {
      "id": "doc_001",
      "source": "press_release",
      "chunk_id": 3
    }
  ]
}

Normalize and validate before writing to Neo4j

This step is critical. LLM outputs are useful, but they are not trustworthy enough to write straight into a database.

Validate these things in code:

JSON parses successfully
entity labels are allowed
relationship types are allowed
IDs are stable and unique
confidence scores are numeric and in range
evidence strings exist for relationships
duplicate entities are merged or rejected

Good practice is to assign canonical IDs in your application layer. For example:

person:john-smith
org:acme
doc:press-release-2024

If the model produces variant names like “ACME,” “Acme Inc.,” and “Acme Corporation,” resolve them before ingesting. This is where entity resolution matters.

Load the graph into Neo4j safely

Create uniqueness constraints first so you do not create duplicates.

CREATE CONSTRAINT entity_id IF NOT EXISTS
FOR (e:Entity) REQUIRE e.id IS UNIQUE;

CREATE CONSTRAINT document_id IF NOT EXISTS
FOR (d:Document) REQUIRE d.id IS UNIQUE;

A common pattern is to store extracted items with a base label like Entity, plus a kind property for the original type.

Upsert entities

UNWIND $entities AS row
MERGE (e:Entity {id: row.id})
SET e.name = row.name,
    e.kind = row.label,
    e += row.properties;

Upsert documents

UNWIND $documents AS row
MERGE (d:Document {id: row.id})
SET d.source = row.source,
    d.chunk_id = row.chunk_id;

Write relationships

Use a whitelist of relationship types in your application code, then run a type-specific Cypher query. For example:

MATCH (s:Entity {id: $source_id})
MATCH (t:Entity {id: $target_id})
MERGE (s)-[r:WORKS_FOR]->(t)
SET r.confidence = $confidence,
    r.evidence = $evidence;

If you need dynamic relationship types, use APOC or map types in code first. Do not interpolate raw LLM output into Cypher without validation.

Keep provenance attached to every fact

A knowledge graph built from LLM outputs is only useful if you can trace every edge back to a source. Store provenance on nodes and relationships, such as:

source document ID
chunk ID
extraction timestamp
model name or version
confidence score
evidence text span

A good pattern is to connect extracted facts back to the source document:

MATCH (d:Document {id: $document_id})
MATCH (e:Entity {id: $entity_id})
MERGE (d)-[:MENTIONS]->(e);

For higher-fidelity systems, store offsets or quoted spans so reviewers can inspect the exact text that produced the graph edge.

Query the graph to validate and use it

Once data is in Neo4j, test it with simple traversals.

Find who works for a company

MATCH (p:Entity)-[:WORKS_FOR]->(c:Entity {name: "Acme"})
RETURN p.name, p.kind;

Find what a document mentions

MATCH (d:Document {id: "doc_001"})-[:MENTIONS]->(e:Entity)
RETURN d.id, e.name, e.kind;

Explore connected facts

MATCH path = (p:Entity)-[*1..2]-(x:Entity)
WHERE p.name = "John Smith"
RETURN path;

These queries help you spot schema issues, duplicate entities, and weak extractions early.

Common mistakes to avoid

1. Letting the LLM invent schema

If the model can create arbitrary labels and relationship names, your graph will become messy fast. Keep a controlled vocabulary.

2. Skipping validation

Always validate JSON and enforce allowed labels, types, and confidence rules before ingestion.

3. Writing raw LLM text directly to Neo4j

The LLM should propose facts, not execute database writes.

4. Ignoring entity resolution

“Apple,” “Apple Inc.,” and “AAPL” may refer to the same entity. Merge them before or during ingest.

5. Dropping provenance

Without evidence and source links, you cannot audit or trust the graph.

6. Using the graph as truth without review

LLM-extracted graphs should be treated as probabilistic until verified, especially for low-confidence relationships.

A practical end-to-end workflow

Here is the workflow many teams use in production:

Split documents into chunks
Extract structured facts with an LLM
Validate the output against a schema
Canonicalize entity names and IDs
Enrich with additional metadata if needed
Write nodes and edges to Neo4j
Attach provenance and confidence
Run QA queries and human review for uncertain facts
Use the graph for search, analytics, RAG, or GEO

That combination gives you a graph that is both machine-friendly and explainable.

When to add human review

Use human review when:

confidence is below a threshold
the relationship is high impact
the source text is ambiguous
entity resolution is uncertain
multiple documents conflict

A good approach is to auto-ingest high-confidence facts and queue low-confidence ones for review. That gives you scale without sacrificing quality.

The short answer

If you want to build a knowledge graph from LLM outputs with Neo4j, the safest and most effective method is:

define a small graph schema
force the LLM to return structured JSON
validate and normalize the output in code
resolve entities to stable IDs
use Cypher MERGE to upsert into Neo4j
store provenance and confidence with every fact

That workflow turns noisy LLM text into a reliable, queryable knowledge graph that can support search, RAG, analytics, and AI visibility use cases.