Enhancing MySQL Searches with Vector Embeddings

Imagine transforming MySQL into a search powerhouse capable of understanding the subtleties of human language.

Vector embeddings are the secret ingredient, enabling just that. Yet, traditional MySQL setups falter with the complexities of high-dimensional vector data.

This is where the newly open-sourced mysql_vss plugin that I created comes in, integrating Spotify’s Annoy library to revolutionize MySQL’s search functions.

Say goodbye to slow and tricky searches. The mysql_vss plugin is a real game-changer. Once production ready, it will be perfect for developers and companies who need quick and sharp semantic searches, compatible with any vector embedding model you choose.

This blog post will show you how mysql_vss can change the game. I’ll walk you through a simple demo to see how it can make MySQL semantic searches faster and smarter for powering AI with context of your data.

The Naive Approach to Vector Embedding Storage and Search

The conventional method of storing vector embeddings in MySQL databases involves the use of BLOBs or JSON fields. To find the most similar vectors to a query vector, we would typically calculate the Euclidean distance between vectors and sort these distances. This brute-force method is straightforward but comes with a hefty cost in computational resources, and it’s notoriously slow and inefficient for large datasets.

Introducing mysql_vss: An Advanced Solution

Enter mysql_vss, a MySQL plugin that brings the power of Spotify’s Annoy (Approximate Nearest Neighbors Oh Yeah) library to the world of relational databases. With mysql_vss, users can perform fast and efficient similarity searches on vector embeddings right within MySQL. This solution not only outperforms the naive approach but also offers a seamless experience for those already familiar with MySQL.

Schema Setup for Vector Embeddings with mysql_vss

To use mysql_vss, first set up an embeddings table in MySQL. This table should include fields for unique identifiers, the vector data in JSON format, the original text or data that the vector represents, and an Annoy index. The JSON format provides flexibility and ease of use, making it simple to store and retrieve vector data.

*For the v0.0.1 pre-production release of mysql_vss, we use this embeddings table to lazy load the Annoy Index on first loadable function (formerly, “UDF”) invocation. After the first invocation, the Annoy Index handles all vector search operations.

An minimum schema for the embeddings table:

-- our embeddings table
CREATE TABLE IF NOT EXISTS embeddings (
    ID INT PRIMARY KEY,
    vector JSON NOT NULL,
    original_text TEXT NOT NULL,
    annoy_index INT
    -- Any other related fields, such as foreign keys to other tables the vector column represents
);

Now, all we need to do is create the MySQL loadable function, after copying libmysql_vss.so over into the MySQL plugin directory:

-- create our vss_search function
CREATE FUNCTION vss_search RETURNS STRING SONAME 'libmysql_vss.so';

The return of vss_search is a comma-separated string containing the relevant embedding table primary keys, from the Annoy Index search result list.

Using the Example Application to Demonstrate mysql_vss

The example application provided with mysql_vss offers a hands-on way to understand how the plugin works. By using a Docker container, you can quickly set up a MySQL instance pre-loaded with mysql_vss and sample data. The app.py script in the example application showcases how to connect to the MySQL database, generate vector embeddings from text data, store these embeddings in the embeddings table, and perform vector similarity searches.

import mysql.connector
import json
from embedding_util import generate_embeddings

# Database connection parameters
db_config = {
    "host": "127.0.0.1",
    "user": "mysql",
    "password": "password1234",
    "database": "wordpress",
    "port": 3306
}

# Connect to the database
try:
    db_connection = mysql.connector.connect(**db_config)
    cursor = db_connection.cursor()

    # Fetch each post content from the wp_posts table
    cursor.execute("SELECT ID, post_content FROM wp_posts;")
    records = cursor.fetchall()

    for record in records:
        post_id, post_content = record
        # Generate a vector embedding for the post content
        embedding = generate_embeddings(post_content)

        # Insert the generated embedding into the embeddings table
        insert_query = """
        INSERT IGNORE INTO embeddings (ID, vector, original_text)
        VALUES (%s, %s, %s);
        """
        cursor.execute(
            insert_query, (post_id, json.dumps(embedding), post_content))

    # Commit the changes to the database
    db_connection.commit()

    query = "Do you have any content about sea creatures?"
    # Generate an embedding for the query
    query_embedding = generate_embeddings(query)

    query_embedding_str = json.dumps(query_embedding)

    # Look up the original content from the source embeddings table using the retrieved IDs
    combined_query = """
    SELECT
        e.original_text
    FROM
        embeddings AS e
    WHERE
        FIND_IN_SET(e.ID, (SELECT CAST(vss_search(%s) AS CHAR))) > 0
    ORDER BY
        FIELD(e.ID, (SELECT CAST(vss_search(%s) AS CHAR))) DESC
    """

    cursor.execute(combined_query, (query_embedding_str, query_embedding_str))
    results = cursor.fetchall()

    print(f"QUERY: {query}")
    if results:
        for idx, row in enumerate(results):
            print(f"TOP RESULT {idx + 1}: {row[0]}")
    else:
        print("No matches found.")

except mysql.connector.Error as err:
    print("Error:", err)

finally:
    # Close the cursor and connection
    cursor.close()
    db_connection.close()

This example illustrates how mysql_vss can be used, offering the ability to work with vector data in MySQL databases through an efficient, MySQL-native approach.

Advantages of mysql_vss Over the Naive Approach

The performance improvements of using mysql_vss over the naive approach are significant. By leveraging Annoy’s approximate nearest neighbor search algorithm, mysql_vss can perform similarity searches orders of magnitude faster than the traditional method. This efficiency opens the door to real-time search and analysis on large text collections (such as blog posts, ecommerce product listings or help center resources) that were previously too cumbersome to handle.

Current Limitations and Future Enhancements

As with any emerging technology, mysql_vss is not without its limitations. Currently, it supports a fixed dimensionality of vector embeddings, which may not suit all use cases, and has not been extensively stress tested to determine performance thresholds related to data scale.

🚧 It is not ready for production yet!

However, future enhancements are on the horizon, such as dynamic index reloading and improved data scaling capabilities, which will further refine and expand the plugin’s functionality. See known issues on the GitHub repository.

Conclusion and Potential Use Cases for mysql_vss

mysql_vss exemplifies the innovative spirit transforming database management, offering a sophisticated and user-friendly approach to executing vector similarity searches within MySQL. As this technology continues to evolve, it has the potential to revolutionize a variety of fields and applications:

Enhanced Search Capabilities for E-Commerce: Online retailers can integrate mysql_vss to improve their product search experience, allowing customers to find what they’re looking for with unprecedented accuracy and speed.
Content Recommendation Engines: Media platforms can utilize vector similarity search to recommend articles, videos, and products more effectively, keeping users engaged and increasing platform retention.
Advanced Search in WordPress: For WordPress sites that host vast amounts of content, mysql_vss can be used to develop advanced search features that quickly connect users with the most relevant articles, enhancing the user experience.
Chatbots and Customer Support: WordPress sites with integrated chatbots can benefit from mysql_vss by providing more accurate responses to user inquiries. By searching through vectorized knowledge bases, chatbots can deliver responses that are contextually relevant to complex queries.
Research and Data Analysis: Researchers can leverage the plugin for large-scale data analysis, particularly in fields like bioinformatics and social sciences where pattern recognition in high-dimensional data is crucial.

mysql_vss offers a MySQL-native vector embedding search and storage solution, especially suited (currently) for small to medium scale MySQL databases. For database administrators and developers alike, it offers an exciting opportunity to combine the power of vector data and Large Language Models to drive innovation in your applications.

Questions? Raise an issue or feel free to reach out!