Unstructured data and graphs
Creating knowledge graphs from unstructured data can be complex, involving multiple steps of data query, cleanse, and transform.
You can use the text analysis capabilities of Large Language Models (LLMs) to automate the extraction of entities and relationships from your unstructured text.
An LLM generated this knowledge graph of Technologies
, Concepts
, and Skills
from a lesson on grounding LLMS.
Extend your graph
In this challenge, you will use an LLM to extend your graph with new entities and relationships found in the unstructured text data.
Open the 1-knowledge-graphs-vectors\llm_build_graph.py
starter code that creates the graph of lesson content.
Click to view the starter code
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_neo4j import Neo4jGraph
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_community.graphs.graph_document import Node, Relationship
COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
add_start_index=True
)
chunks = text_splitter.split_documents(docs)
embedding_provider = OpenAIEmbeddings(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model="text-embedding-ada-002"
)
def get_course_data(embedding_provider, chunk):
filename = chunk.metadata["source"]
path = filename.split(os.path.sep)
data = {}
data['course'] = path[-6]
data['module'] = path[-4]
data['lesson'] = path[-2]
data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
data['id'] = f"{filename}.{chunk.metadata["start_index"]}"
data['text'] = chunk.page_content
data['embedding'] = embedding_provider.embed_query(chunk.page_content)
return data
graph = Neo4jGraph(
url=os.getenv('NEO4J_URI'),
username=os.getenv('NEO4J_USERNAME'),
password=os.getenv('NEO4J_PASSWORD')
)
def create_chunk(graph, data):
graph.query("""
MERGE (c:Course {name: $course})
MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
MERGE (l)-[:CONTAINS]->(p:Paragraph{id: $id, text: $text})
WITH p
CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
""",
data
)
# Create an OpenAI LLM instance
# llm =
# Create an LLMGraphTransformer instance
# doc_transformer =
for chunk in chunks:
data = get_course_data(embedding_provider, chunk)
create_chunk(graph, data)
# Generate the graph docs
# graph_docs =
# Map the entities in the graph documents to the paragraph node
# for graph_doc in graph_docs:
# Add the graph documents to the graph
# graph.
print("Processed chunk", data['id'])
You will need to:
-
Create an LLM instance
-
Create a transformer to extract entities and relationships
-
Extract entities and relationships from the text
-
Map the entities to the paragraphs
-
Add the graph documents to the database
Create an LLM
You need an LLM instance to extract the entities and relationships:
# Create an OpenAI LLM instance
llm = ChatOpenAI(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model_name="gpt-3.5-turbo"
)
The model_name
parameter defines which OpenAI model will be used.
gpt-3.5-turbo
is a good choice for this task given its accuracy, speed, and cost.
Graph Transformer
To extract the entities and relationships, you will use a graph transformer. The graph transformer takes unstructured text data, passes it to the LLM, and returns the entities and relationships.
# Create an LLMGraphTransformer instance
doc_transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Technology", "Concept", "Skill", "Event", "Person", "Object"],
)
The optional allowed_nodes
and allowed_relationships
parameters allow you to defined the types of nodes and relationships you want to extract from the text.
In this example, the nodes are restricted to entities relevant to the content. The relationships are not restricted, allowing the LLM to find any relationships between the entities.
Restricting the nodes and relationship will result in a more concise knowledge graph. A more concise graph may support you in answering specific questions but it could also be missing information.
Extract entities and relationships
For each chunk of text, you will use the transformer to convert the text into a graph. The transformer returns a set of graph documents that represent the entities and relationships in the text.
for chunk in chunks:
data = get_course_data(embedding_provider, chunk)
create_chunk(graph, data)
# Generate the graph docs
graph_docs = doc_transformer.convert_to_graph_documents([chunk])
Map extracted entities to the paragraphs
The graph documents contain the extracted nodes and relationships, but they are not linked to the original paragraphs.
To understand which entities are related to which paragraphs, you will map the extracted nodes to the paragraphs.
You will create a data model with a HAS_ENTITY
relationship between the paragraphs and the entities.
Map extracted entities to the paragraphs
This code inserts the Paragraph
node into the graph document, and creates a HAS_ENTITY
relationship between the paragraph and the extracted entities.
# Map the entities in the graph documents to the paragraph node
for graph_doc in graph_docs:
paragraph_node = Node(
id=data["id"],
type="Paragraph",
)
for node in graph_doc.nodes:
graph_doc.relationships.append(
Relationship(
source=paragraph_node,
target=node,
type="HAS_ENTITY"
)
)
Add the graph documents
Finally, you need to add the new graph documents to the Neo4j graph database.
# Add the graph documents to the graph
graph.add_graph_documents(graph_docs)
When you are ready, run the program to extend your graph.
Querying the knowledge graph
You can view the generated entities using the following Cypher query:
MATCH (p:Paragraph)-[:HAS_ENTITY]-(e)
RETURN p, e
Entities
The entities in the graph allow you to understand what the context in the text.
You can find the most mentioned topics in the graph by counting the number of times a node label (or entity) appears in the graph:
MATCH ()-[:HAS_ENTITY]->(e)
RETURN labels(e) as labels, count(e) as nodes
ORDER BY nodes DESC
Entities
You can drill down into the entity id to gain insights into the content.
For example, you can find the most mentioned Technology
.
MATCH ()-[r:HAS_ENTITY]->(e:Technology)
RETURN e.id AS entityId, count(r) AS mentions
ORDER BY mentions DESC
Related lessons
The knowledge graph can also show you the connections within the content. For example, what lessons relate to each other.
This Cypher query matches one specific document and uses the entities to find related documents:
MATCH (l:Lesson {
name: "1-neo4j-and-genai"
})-[:CONTAINS]->(p:Paragraph)
MATCH (p)-[:HAS_ENTITY]->(entity)<-[:HAS_ENTITY]-(otherParagraph)
MATCH (otherParagraph)<-[:CONTAINS]->(otherLesson)
RETURN DISTINCT entity.id, otherLesson.name
Lesson entities
The knowledge graph contains the relationships between entities in all the documents.
This Cypher query restricts the output to a specific chunk or document:
MATCH (l:Lesson {
name: "1-neo4j-and-genai"
})-[:CONTAINS]->(p:Paragraph)
MATCH (p)-[:HAS_ENTITY]->(e)
MATCH path = (e)-[r]-(e2)
WHERE (p)-[:HAS_ENTITY]->(e2)
RETURN path
A path is returned representing the knowledge graph for the document.
Labels, ids, and relationships
You can gain the nodes labels, ids, relationship types by unwinding the path’s relationships:
MATCH (l:Lesson {
name: "1-neo4j-and-genai"
})-[:CONTAINS]->(p:Paragraph)
MATCH (p)-[:HAS_ENTITY]->(e)
MATCH path = (e)-[r]-(e2)
WHERE (p)-[:HAS_ENTITY]->(e2)
UNWIND relationships(path) as rels
RETURN
labels(startNode(rels))[0] as eLabel,
startNode(rels).id as eId,
type(rels) as relType,
labels(endNode(rels))[0] as e2Label,
endNode(rels).id as e2Id
Explore the graph
Take some time to explore the knowledge graph to find relationships between entities and lessons.
Continue
When you are ready, you can move on to the next task.
Summary
You used an LLM to create a knowledge graph from unstructured text.