Unstructured data
Unstructured data is often rich in information but challenging to analyze. Graphs are a powerful tool for representing unstructured data, and vectors and embeddings can help you identify similarities and search for related data.
You will use Python and LangChain to create embeddings and load the unstructured content into a Neo4j graph database.
You will load the content from the GraphAcademy course Neo4j & LLM Fundamentals.
The course content
The 1-knowledge-graphs-vectors\data
directory in the workshop repository contains the course data.
Open the directory and note the following structure:
-
asciidoc
- the course content in asciidoc format-
courses
- the course content-
llm-fundamentals
- the course name-
modules
- numbered directories for each module-
01-name
- the module name-
lessons
- numbered directories for each lesson-
01-name
- the lesson name-
lesson.adoc
- the lesson content
-
-
-
-
-
-
-
Load the content and chunk it
You will load the content and chunk it using Python and LangChain.
More on chunking
When dealing with large amounts of data, breaking it into smaller, more manageable parts is helpful. This process is called chunking.
Smaller pieces of data are easier to work with and process. Embedding models also have size (token) limits and can only handle a certain amount of data.
Embedding large amounts of text may also be less valuable. For example, if you are trying to find a document that references a specific topic, the meaning maybe lost in the whole document. Instead, you may only need the paragraph or sentence that contains the relevant information. Conversely, small amounts of data may not contain enough context to be useful.
There are countless strategies for splitting data into chunks, and the best approach depends on the data and the problem you are trying to solve.
Open the 1-knowledge-graphs-vectors/create_vector.py
file and review the program.
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_community.document_loaders import DirectoryLoader, TextLoader
COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"
# Load lesson documents
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
# Create a text splitter
# text_splitter =
# Split documents into chunks
# chunks =
# Create an embedding provider
# embedding_provider =
# Create a Neo4j vector store
# neo4j_db =
The program uses the DirectoryLoader
class to load the content from the data/asciidoc
directory.
Your task is to add the code to:
-
Create a
CharacterTextSplitter
object to split the content into chunks of text. -
Use the
split_documents
method to split the documents into chunks of text based on the existence of\n\n
and a chunk size of 1500 characters.
Create the text splitter
Create the CharacterTextSplitter object to split the content into paragraphs (\n\n
).
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
)
The text_splitter
will create chunks of text, around 1500 characters long, each containing one or more paragraphs.
Split the documents
Split the documents into chunks of text.
chunks = text_splitter.split_documents(docs)
print(chunks)
More on splitting
The content isn’t split simply by a character (\n\n
) or on a fixed number of characters.
The process is more complicated.
Chunks should be up to maximum size but conform to the character split.
In this example, the split_documents
method does the following:
-
Splits the documents into paragraphs (using the
separator
-\n\n
) -
Combines the paragraphs into chunks of text that are up 1500 characters (
chunk_size
)-
if a single paragraph is longer than 1500 characters, the method will not split the paragraph but create a chunk larger than 1500 characters
-
-
Adds the last paragraph in a chunk to the start of the next paragraph to create an overlap between chunks.
-
if the last paragraph in a chunk is more than 200 characters (
chunk_overlap
) it will not be added to the next chunk
-
This process ensures that:
-
Chunks are never too small.
-
That a paragraph is never split between chunks.
-
That chunks are significantly different, and the overlap doesn’t result in a lot of repeated content.
Investigate what happens when you modify the separator
, chunk_size
and chunk_overlap
parameters.
Create vector index
Once you have chunked the content, you can use the LangChain Neo4jVector
class to create embeddings, a vector index, and store the chunks in a Neo4j graph database.
You will need to modify your Python program to:
-
Connect to the Neo4j database.
-
Create an embedding provider.
-
Create the nodes and vector index.
Connect
Connect to the Neo4j database:
from langchain_neo4j import Neo4jGraph
graph = Neo4jGraph(
url=os.getenv('NEO4J_URI'),
username=os.getenv('NEO4J_USERNAME'),
password=os.getenv('NEO4J_PASSWORD'),
)
Embeddings
Create an embedding provider:
from langchain_openai import OpenAIEmbeddings
embedding_provider = OpenAIEmbeddings(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model="text-embedding-ada-002"
)
Create
Create the nodes and vector index:
from langchain_neo4j import Neo4jVector
neo4j_vector = Neo4jVector.from_documents(
chunks,
embedding_provider,
graph=graph,
index_name="chunkVector",
node_label="Chunk",
text_node_property="text",
embedding_node_property="embedding",
)
The code will create Chunk
nodes with text
and embedding
properties and a vector index called chunkVector
.
View the complete code
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_neo4j import Neo4jGraph
from langchain_openai import OpenAIEmbeddings
from langchain_neo4j import Neo4jVector
COURSES_PATH = "1-knowledge-graphs-vectors/data/asciidoc"
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1500,
chunk_overlap=200,
)
chunks = text_splitter.split_documents(docs)
print(chunks)
graph = Neo4jGraph(
url=os.getenv('NEO4J_URI'),
username=os.getenv('NEO4J_USERNAME'),
password=os.getenv('NEO4J_PASSWORD'),
)
embedding_provider = OpenAIEmbeddings(
openai_api_key=os.getenv('OPENAI_API_KEY'),
model="text-embedding-ada-002"
)
neo4j_vector = Neo4jVector.from_documents(
chunks,
embedding_provider,
graph=graph,
index_name="chunkVector",
node_label="Chunk",
text_node_property="text",
embedding_node_property="embedding",
)
Run the program to create the chunk nodes and vector index.
It may take a minute or two to complete.
View chunks in the sandbox
You can now view the chunks in the Neo4j sandbox.
MATCH (c:Chunk) RETURN c LIMIT 25
Query the vector index
You can also query the vector index to find similar chunks. For example, you can find lesson chunks relating to a specific question, "What does Hallucination mean?":
WITH genai.vector.encode(
"What does Hallucination mean?",
"OpenAI",
{ token: "sk-..." }) AS userEmbedding
CALL db.index.vector.queryNodes('chunkVector', 6, userEmbedding)
YIELD node, score
RETURN node.text, score
Remember to replace sk-…
with your OpenAI API key.
Experiment with different questions and see how the vector index can find similar chunks.
Continue
When you are ready, you can move on to the next task.
Summary
You learned to use Python and LangChain to load, chunk, and vectorize unstructured data into a Neo4j graph database.