How to Build Lightweight GraphRAG with SQLite

How to Build Lightweight GraphRAG with SQLite blog post fallback blur image How to Build Lightweight GraphRAG with SQLite blog post main image
Stephen CollinsOct 3, 2024
10 mins

What you will learn

  • What is GraphRAG?
  • GraphRAG (Graph Retrieval-Augmented Generation) is a system for document processing, relationship extraction, and graph-based querying.
  • Why use SQLite for graph data management in a GraphRAG system?
  • SQLite offers a portable, serverless solution that is ideal for small-to-medium datasets, providing a simple way to store entities and relationships without requiring a full graph database.
  • How are centrality measures used in the GraphRAG system?
  • Centrality measures, such as degree and betweenness, are calculated to prioritize the most relevant entities in the graph to improve query responses.
  • What is the role of the `DocumentProcessor` class in the GraphRAG system?
  • The `DocumentProcessor` class is responsible for splitting documents into manageable chunks, extracting key entities and relationships, and summarizing the results using OpenAI’s GPT models.
  • How can the graph data from the SQLite-based GraphRAG system be visualized?
  • The graph data can be exported to a JSON file using a script and then visualized using D3.js by serving the visual content over a local HTTP server.

In this tutorial, I’ll walk through an example implementation of a lightweight GraphRAG (Graph Retrieval-Augmented Generation) system using SQLite instead of Neo4j (which is what I used in the previous article). This setup offers a portable, serverless solution for document processing, relationship extraction, and graph-based querying.

We’ll be using SQLite to store entities and relationships extracted from documents using OpenAI’s GPT models. By calculating centrality measures (such as degree and betweenness), we can prioritize the most relevant entities in the graph to improve query responses. Although SQLite lacks some of the advanced features of dedicated graph databases, it is an ideal solution for small-to-medium datasets that don’t require a full graph database like Neo4j.

This guide will cover:

  • Setting up SQLite for managing graph data
  • Extracting entities and relationships using OpenAI GPT models
  • Using centrality measures to improve query relevance
  • Visualizing graph data with D3.js

You can follow along with the complete source code on GitHub.


Prerequisites

Ensure you have the following:

  • Python 3.9+
  • SQLite (comes pre-installed with Python)
  • Necessary libraries: openai, python-dotenv

You can install these dependencies with:

pip install openai python-dotenv

Project Overview

This GraphRAG system is designed using an object-oriented approach. It processes documents, extracts entities and relationships, and stores them in a SQLite database. By calculating centrality measures, we can identify the most important nodes (entities) in the graph, enabling more accurate and relevant query responses.

Project Structure

  • app.py: Orchestrates the document processing and querying workflow.
  • GraphManager: Manages SQLite operations, builds the graph, calculates centrality measures, and handles graph updates.
  • QueryHandler: Handles user queries and uses GPT models to generate responses based on centrality measures.
  • DocumentProcessor: Splits documents, extracts entities and relationships, and summarizes them.
  • GraphDatabase: Manages the SQLite database connection.
  • Logger: Provides logging utilities to track the workflow.

Using SQLite to Manage Graph Data

SQLite offers a simple yet effective way to store graph data, such as entities and relationships, extracted from documents. The GraphDatabaseConnection class manages the connection to the SQLite database:

import sqlite3

class GraphDatabaseConnection:
    def __init__(self, db_path='graph_database.db'):
        self.conn = sqlite3.connect(db_path)
        # Use WAL mode for better performance
        self.conn.execute('PRAGMA journal_mode=WAL;')
        self.initialize_schema()

    def initialize_schema(self):
        with self.conn:
            self.conn.execute('''
                CREATE TABLE IF NOT EXISTS nodes (
                    id TEXT PRIMARY KEY,
                    properties TEXT
                )
            ''')

            self.conn.execute('''
                CREATE TABLE IF NOT EXISTS edges (
                    source TEXT,
                    target TEXT,
                    relationship TEXT,
                    weight REAL,
                    PRIMARY KEY (source, target, relationship),
                    FOREIGN KEY (source) REFERENCES nodes(id),
                    FOREIGN KEY (target) REFERENCES nodes(id)
                )
            ''')

            self.conn.execute('''
                CREATE INDEX IF NOT EXISTS source_idx ON edges(source)
            ''')

            self.conn.execute('''
                CREATE INDEX IF NOT EXISTS target_idx ON edges(target)
            ''')

    def close(self):
        self.conn.close()

    def get_session(self):
        return self.conn

This setup ensures that we have two main tables: nodes (for storing entities) and edges (for relationships between entities). Each time we process documents, entities and their relationships are inserted into these tables. Additionally, indices are created on the source and target columns of the edges table to improve query performance.


Document Processing with DocumentProcessor

The DocumentProcessor class is responsible for splitting documents, extracting key entities and relationships, and summarizing the results using OpenAI’s GPT models. Here’s how you can process a batch of documents:

Example: Document Processing

from logger import Logger


class DocumentProcessor:
    logger = Logger("DocumentProcessor").get_logger()

    def __init__(self, client, model):
        self.client = client
        self.model = model

    def split_documents(self, documents, chunk_size=600, overlap_size=100):
        chunks = []
        for document in documents:
            for i in range(0, len(document), chunk_size - overlap_size):
                chunk = document[i:i + chunk_size]
                chunks.append(chunk)
        self.logger.debug("Documents split into %d chunks", len(chunks))
        return chunks

    def extract_elements(self, chunks):
        elements = []
        for index, chunk in enumerate(chunks):
            self.logger.debug(
                f"Extracting elements and relationship strength from chunk {index + 1}")
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system",
                        "content": "Extract entities, relationships, and their strength from the following text. Use common terms such as 'related to', 'depends on', 'influences', etc., for relationships, and estimate a strength between 0.0 (very weak) and 1.0 (very strong). Format: Parsed relationship: Entity1 -> Relationship -> Entity2 [strength: X.X]. Do not include any other text in your response. Use this exact format: Parsed relationship: Entity1 -> Relationship -> Entity2 [strength: X.X]."},
                    {"role": "user", "content": chunk}
                ]
            )
            entities_and_relations = response.choices[0].message.content
            elements.append(entities_and_relations)
        self.logger.debug("Elements extracted")
        return elements

    def summarize_elements(self, elements):
        summaries = []
        for index, element in enumerate(elements):
            self.logger.debug(f"Summarizing element {index + 1}")
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "Summarize the following entities and relationships in a structured format. Use common terms such as 'related to', 'depends on', 'influences', etc., for relationships. Use '->' to represent relationships after the 'Relationships:' word."},
                    {"role": "user", "content": element}
                ]
            )
            summary = response.choices[0].message.content
            summaries.append(summary)
        self.logger.debug("Summaries created")
        return summaries

This class will split a large document into manageable chunks and use OpenAI’s GPT model to extract and summarize the key entities and relationships.


Graph Management with GraphManager

Once the entities and relationships have been extracted, the GraphManager class stores them in SQLite as a graph, where nodes represent entities and edges represent relationships. It also calculates centrality measures to identify the most important nodes.

Example: Building the Graph

class GraphManager:
    def __init__(self, db_connection):
        self.db_connection = db_connection
        # Optionally clear the database, else remove this line
        self.db_connection.clear_database()

    def build_graph(self, summaries):
        if self.db_connection is None:
            self.logger.error("Graph database connection is not available.")
            return

        entities = {}
        conn = self.db_connection.get_session()

        with conn:
            for summary in summaries:
                # Insert entities and relationships into SQLite
                # Parse the summary to extract entities and relationships
                # ...

This method creates nodes and relationships based on the processed document summaries and stores them in the SQLite database.


Using Centrality Measures to Enhance Query Relevance

The GraphManager class also calculates centrality measures such as degree, betweenness, and closeness centrality. These metrics help prioritize the most relevant entities for answering user queries.

Example: Calculating Centrality

    def calculate_centrality_measures(self):
        conn = self.db_connection.get_session()

        # Degree centrality: count incoming/outgoing edges for each node
        degree_centrality_query = """
            SELECT id,
                   (SELECT COUNT(*) FROM edges WHERE source = nodes.id) +
                   (SELECT COUNT(*) FROM edges WHERE target = nodes.id) as degree
            FROM nodes
            ORDER BY degree DESC
            LIMIT 10
        """

        self.logger.debug("Starting degree centrality query")
        start_time = time.time()
        degree_centrality_result = conn.execute(
            degree_centrality_query).fetchall()
        end_time = time.time()
        self.logger.debug(
            f"Degree centrality query completed in {end_time - start_time:.8f} seconds")

        # We won't implement betweenness and closeness for now as SQLite does not have graph-native support
        centrality_data = {
            "degree": [{"entityName": row[0], "score": row[1]} for row in degree_centrality_result],
            "betweenness": [],
            "closeness": []
        }

        return centrality_data

This function calculates degree centrality—the number of connections each node has—which is a simple yet effective way to identify key entities in the graph. By leveraging indices on the source and target columns of the edges table, the function ensures faster lookup and improved performance.


Running the Application

To get started, follow these steps:

  1. Clone the repository:

    git clone git@github.com:stephenc222/example-graphrag-with-sqlite.git
    cd example-graphrag-with-sqlite
  2. Install the dependencies:

    pip install -r requirements.txt
  3. Create a .env file with your OpenAI API key:

    OPENAI_API_KEY=<your-openai-api-key>
    DB_PATH=data/graph_database.sqlite
  4. Run the application:

    python app.py

This will:

  1. Index the initial documents.
  2. Process a user query to extract key themes.
  3. Reindex the graph with new documents and update centrality measures.
  4. Handle another query based on the updated graph.

Visualizing the Graph with D3.js

To visualize the graph, we can use D3.js. First, we need to export the graph data from SQLite to a JSON file using the export_graph_data.py script:

python export_graph_data.py data/graph_database.sqlite

Next, we use D3.js to visualize the graph. From the root of the project, run:

python -m http.server --directory public 8000

Then, open your browser and navigate to http://localhost:8000 to see the graph visualization.


Conclusion

SQLite provides a lightweight and easy-to-set-up solution for managing a GraphRAG system, ideal for small-to-medium datasets of approximately 100-1000 documents conservatively. By combining it with OpenAI’s GPT models, we can efficiently process documents, extract entities, and handle queries with relevance-enhancing centrality measures. While SQLite doesn’t offer advanced graph algorithms like Neo4j, it’s an excellent choice for lightweight and self-contained applications.

Check out the full source code on GitHub, and feel free to extend this implementation with more advanced features or optimizations tailored to your use case.