Creating a graph

Creating a graph

In the previous task, you used the Neo4jVector class to create Chunk nodes in the graph. Using Neo4jVector is an efficient and easy way to get started.

To create a graph where you can also understand the relationships within the data, you must incorporate the metadata into the data model.

In this lesson, you will create a graph of the course content.

Data Model

You will create a graph of the course content containing the following nodes, properties, and relationships:

  • Course, Module, and Lesson nodes with a name property

  • A url property on Lesson nodes will hold the GraphAcademy URL for the lesson

  • Paragraph nodes will have id, text, and embedding properties

  • The HAS_MODULE, HAS_LESSON, and CONTAINS relationships will connect the nodes

Data model showing Course

Data Model

You can extract the name properties and url metadata from the directory structure of the lesson files.

For example, the first lesson of the Neo4j & LLM Fundamentals course has the following path:

courses\llm-fundamentals\modules\1-introduction\lessons\1-neo4j-and-genai\lesson.adoc

The following metadata is in the path:

  • Course.name - llm-fundamentals

  • Module.name - 1-introduction

  • Lesson.name - 1-neo4j-and-genai

  • Lesson.url - graphacademy.neo4j.com/courses/{Course.name}/{{Module.name}}/{Lesson.name}

Building the graph

Open the 1-knowledge-graphs-vectors\build_graph.py starter code in your code editor.

The starter code loads and chunks the course content.

python
Load and chunk the content
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_neo4j import Neo4jGraph
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
    add_start_index=True
)

chunks = text_splitter.split_documents(docs)

# Create an OpenAI embedding provider

# Create a function to get the course data

# Connect to Neo4j

# Create a function to run the Cypher query

# Iterate through the chunks and create the graph

For each chunk, you will have to:

  1. Create an embedding of the text.

  2. Extract the metadata.

Extracting the data

Create an OpenAI embedding provider instance to generate the embeddings:

python
Create embedding_provider
embedding_provider = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model="text-embedding-ada-002"
    )

Extracting the data

Create a function to extract the metadata from the chunk:

python
Get course data
def get_course_data(embedding_provider, chunk):
    filename = chunk.metadata["source"]
    path = filename.split(os.path.sep)

    data = {}
    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['id'] = f"{filename}.{chunk.metadata["start_index"]}"
    data['text'] = chunk.page_content
    data['embedding'] = embedding_provider.embed_query(chunk.page_content)
    return data

The get_course_data function:

  1. Splits the document source path to extract the course, module, and lesson names

  2. Constructs the url using the extracted names

  3. Creates a unique id for the paragraph from the file name and the chunk position

  4. Extracts the text from the chunk

  5. Creates an embedding using the embedding_provider instance

  6. Returns a dictionary containing the extracted data

Creating the graph

To create the graph, you will need to:

  1. Connect to the Neo4j database

  2. Iterate through the chunks

  3. Extract the course data from each chunk

  4. Create the nodes and relationships in the graph

Connect

Connect to the Neo4j sandbox:

python
graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

Test the connection

You could run your code now to check that you can connect to the OpenAI API and Neo4j sandbox.

Create data

To create the data in the graph, you will need a function that incorporates the course data into a Cypher statement and runs it:

python
Create chunk function
def create_chunk(graph, data):
    graph.query("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{id: $id, text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """, 
        data
    )

The create_chunk function accepts the data dictionary created by the get_course_data function.

You should be able to identify the following parameters in the Cypher statement:

  • $course

  • $module

  • $lesson

  • $url

  • $id

  • $text

  • $embedding

Create chunk

Iterate through the chunks and execute the create_chunk function:

python
for chunk in chunks:
    data = get_course_data(embedding_provider, chunk)
    create_chunk(graph, data)
    print("Processed chunk", data['id'])

The metadata is found for each chunk and used to create a new chunk in the graph.

Click to view the complete code
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_neo4j import Neo4jGraph
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
    add_start_index=True
)

chunks = text_splitter.split_documents(docs)

embedding_provider = OpenAIEmbeddings(
    openai_api_key=os.getenv('OPENAI_API_KEY'),
    model="text-embedding-ada-002"
    )

def get_course_data(embedding_provider, chunk):
    filename = chunk.metadata["source"]
    path = filename.split(os.path.sep)

    data = {}
    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['id'] = f"{filename}.{chunk.metadata["start_index"]}"
    data['text'] = chunk.page_content
    data['embedding'] = embedding_provider.embed_query(chunk.page_content)
    return data

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

def create_chunk(graph, data):
    graph.query("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{id: $id, text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
        """, 
        data
    )

for chunk in chunks:
    data = get_course_data(embedding_provider, chunk)
    create_chunk(graph, data)
    print("Processed chunk", data['id'])

Run the code to create the graph.

The program will take a minute or two to complete as it creates the embeddings for each paragraph.

Explore the graph

View the graph by running the following Cypher:

cypher
MATCH (c:Course)-[:HAS_MODULE]->(m:Module)-[:HAS_LESSON]->(l:Lesson)-[:CONTAINS]->(p:Paragraph)
RETURN *
Result from the Cypher

Create vector index

You will need to create a vector index to query the paragraph embeddings.

cypher
Create Vector Index
CREATE VECTOR INDEX paragraphs IF NOT EXISTS
FOR (p:Paragraph)
ON p.embedding
OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,
 `vector.similarity_function`: 'cosine'
}}

Query the vector index

You can use the vector index and the graph to find a lesson to help with specific questions:

cypher
Find a lesson
WITH genai.vector.encode(
    "How does RAG help ground an LLM?",
    "OpenAI",
    { token: "sk-..." }) AS userEmbedding
CALL db.index.vector.queryNodes('paragraphs', 6, userEmbedding)
YIELD node, score
MATCH (l:Lesson)-[:CONTAINS]->(node)
RETURN l.name, l.url, score

Summary

You created a graph of the course content using the Neo4j and LangChain.