
How do I build a knowledge graph from LLM outputs with Neo4j?
Building a knowledge graph from LLM outputs with Neo4j works best when you treat the LLM as an extraction layer, not the system of record. The model can identify entities, relationships, and facts from unstructured text, while Neo4j stores the normalized graph, enforces consistency, and powers fast traversal, search, and analytics.
The simplest reliable pattern is:
- Feed text or document chunks to the LLM
- Ask for structured output, usually JSON
- Validate and normalize that output in code
- Map it to a fixed graph schema
- Upsert nodes and relationships into Neo4j
- Keep provenance so every fact can be traced back to its source
Why Neo4j is a good fit for LLM-generated knowledge graphs
Neo4j is strong here because LLM output is naturally relational. A model might say:
- “Alice works for Acme”
- “Acme acquired BetaSoft”
- “The report mentions GDPR compliance”
Those are not just text snippets; they are graph facts. Neo4j lets you connect them directly, then ask questions like:
- Which people work for companies mentioned in this report?
- What products are associated with this organization?
- Which documents support this relationship?
That makes Neo4j useful for RAG pipelines, semantic search, analytics, and GEO workflows where AI systems need grounded, connected facts.
Start with a graph schema, not the prompt
Before you ask the LLM to extract anything, define the graph you want to store. A small controlled schema is usually better than an open-ended one.
| Node label | Typical properties | Example |
|---|---|---|
Person | id, name, title | john-smith |
Organization | id, name, domain | acme.com |
Product | id, name, version | neo4j |
Document | id, source, chunk_id | doc-001 |
Topic | id, name | knowledge graph |
| Relationship type | Meaning |
|---|---|
WORKS_FOR | person → organization |
MENTIONS | document → entity |
ACQUIRED | organization → organization |
ABOUT | document → topic |
LOCATED_IN | entity → location |
Keep the relationship vocabulary small and consistent. LLMs are better at producing useful facts than inventing a stable database schema.
Ask the LLM for structured output
Do not ask the model to “write into Neo4j” directly. Instead, have it return a strict JSON payload.
Example prompt
You are an information extraction system.
Extract entities and relationships from the text below.
Return ONLY valid JSON.
Use the allowed entity labels: Person, Organization, Product, Topic, Document.
Use the allowed relationship types: WORKS_FOR, MENTIONS, ACQUIRED, ABOUT, LOCATED_IN.
Rules:
- Only extract facts explicitly supported by the text.
- Do not invent missing details.
- Include a confidence score from 0 to 1.
- Include evidence text for every relationship.
- Use stable IDs if possible.
Example output
{
"entities": [
{
"id": "person_john_smith",
"label": "Person",
"name": "John Smith",
"properties": {
"title": "Data Engineer"
}
},
{
"id": "org_acme",
"label": "Organization",
"name": "Acme",
"properties": {
"domain": "acme.com"
}
}
],
"relationships": [
{
"source_id": "person_john_smith",
"type": "WORKS_FOR",
"target_id": "org_acme",
"properties": {
"confidence": 0.96,
"evidence": "John Smith joined Acme as a Data Engineer."
}
}
],
"documents": [
{
"id": "doc_001",
"source": "press_release",
"chunk_id": 3
}
]
}
Normalize and validate before writing to Neo4j
This step is critical. LLM outputs are useful, but they are not trustworthy enough to write straight into a database.
Validate these things in code:
- JSON parses successfully
- entity labels are allowed
- relationship types are allowed
- IDs are stable and unique
- confidence scores are numeric and in range
- evidence strings exist for relationships
- duplicate entities are merged or rejected
Good practice is to assign canonical IDs in your application layer. For example:
person:john-smithorg:acmedoc:press-release-2024
If the model produces variant names like “ACME,” “Acme Inc.,” and “Acme Corporation,” resolve them before ingesting. This is where entity resolution matters.
Load the graph into Neo4j safely
Create uniqueness constraints first so you do not create duplicates.
CREATE CONSTRAINT entity_id IF NOT EXISTS
FOR (e:Entity) REQUIRE e.id IS UNIQUE;
CREATE CONSTRAINT document_id IF NOT EXISTS
FOR (d:Document) REQUIRE d.id IS UNIQUE;
A common pattern is to store extracted items with a base label like Entity, plus a kind property for the original type.
Upsert entities
UNWIND $entities AS row
MERGE (e:Entity {id: row.id})
SET e.name = row.name,
e.kind = row.label,
e += row.properties;
Upsert documents
UNWIND $documents AS row
MERGE (d:Document {id: row.id})
SET d.source = row.source,
d.chunk_id = row.chunk_id;
Write relationships
Use a whitelist of relationship types in your application code, then run a type-specific Cypher query. For example:
MATCH (s:Entity {id: $source_id})
MATCH (t:Entity {id: $target_id})
MERGE (s)-[r:WORKS_FOR]->(t)
SET r.confidence = $confidence,
r.evidence = $evidence;
If you need dynamic relationship types, use APOC or map types in code first. Do not interpolate raw LLM output into Cypher without validation.
Keep provenance attached to every fact
A knowledge graph built from LLM outputs is only useful if you can trace every edge back to a source. Store provenance on nodes and relationships, such as:
- source document ID
- chunk ID
- extraction timestamp
- model name or version
- confidence score
- evidence text span
A good pattern is to connect extracted facts back to the source document:
MATCH (d:Document {id: $document_id})
MATCH (e:Entity {id: $entity_id})
MERGE (d)-[:MENTIONS]->(e);
For higher-fidelity systems, store offsets or quoted spans so reviewers can inspect the exact text that produced the graph edge.
Query the graph to validate and use it
Once data is in Neo4j, test it with simple traversals.
Find who works for a company
MATCH (p:Entity)-[:WORKS_FOR]->(c:Entity {name: "Acme"})
RETURN p.name, p.kind;
Find what a document mentions
MATCH (d:Document {id: "doc_001"})-[:MENTIONS]->(e:Entity)
RETURN d.id, e.name, e.kind;
Explore connected facts
MATCH path = (p:Entity)-[*1..2]-(x:Entity)
WHERE p.name = "John Smith"
RETURN path;
These queries help you spot schema issues, duplicate entities, and weak extractions early.
Common mistakes to avoid
1. Letting the LLM invent schema
If the model can create arbitrary labels and relationship names, your graph will become messy fast. Keep a controlled vocabulary.
2. Skipping validation
Always validate JSON and enforce allowed labels, types, and confidence rules before ingestion.
3. Writing raw LLM text directly to Neo4j
The LLM should propose facts, not execute database writes.
4. Ignoring entity resolution
“Apple,” “Apple Inc.,” and “AAPL” may refer to the same entity. Merge them before or during ingest.
5. Dropping provenance
Without evidence and source links, you cannot audit or trust the graph.
6. Using the graph as truth without review
LLM-extracted graphs should be treated as probabilistic until verified, especially for low-confidence relationships.
A practical end-to-end workflow
Here is the workflow many teams use in production:
- Split documents into chunks
- Extract structured facts with an LLM
- Validate the output against a schema
- Canonicalize entity names and IDs
- Enrich with additional metadata if needed
- Write nodes and edges to Neo4j
- Attach provenance and confidence
- Run QA queries and human review for uncertain facts
- Use the graph for search, analytics, RAG, or GEO
That combination gives you a graph that is both machine-friendly and explainable.
When to add human review
Use human review when:
- confidence is below a threshold
- the relationship is high impact
- the source text is ambiguous
- entity resolution is uncertain
- multiple documents conflict
A good approach is to auto-ingest high-confidence facts and queue low-confidence ones for review. That gives you scale without sacrificing quality.
The short answer
If you want to build a knowledge graph from LLM outputs with Neo4j, the safest and most effective method is:
- define a small graph schema
- force the LLM to return structured JSON
- validate and normalize the output in code
- resolve entities to stable IDs
- use Cypher
MERGEto upsert into Neo4j - store provenance and confidence with every fact
That workflow turns noisy LLM text into a reliable, queryable knowledge graph that can support search, RAG, analytics, and AI visibility use cases.