Introduction to Vector Search and Embeddings
What you will learn
- What is vectorization in the context of NLP?
- Vectorization in NLP is a technique where textual information is converted into numerical format, enabling computers to understand and interpret text.
- What is an embedding in natural language processing?
- An embedding is a multi-dimensional representation of a word or a group of words in numerical form, capturing the meaning and semantic relationships between words.
- What does vector search involve?
- Vector search involves finding the closest vectors in a given space that are relevant or similar to a particular query vector, often using cosine similarity to compute the matches.
- How does the given Python code example utilize embeddings for finding relevant texts?
- The code example uses the Sentence Transformer library to create embeddings of texts and calculates cosine similarity with a query vector to find and return indices of the most relevant texts.
- What next steps are suggested for scaling vector search capabilities?
- For scaling vector search, it is suggested to explore specialized vector databases like FAISS, Annoy, or Elasticsearch and investigate robust ways of creating embeddings using models like BERT, RoBERTa, and using APIs like OpenAI and Cohere.
In the age of information, extracting relevant information from massive data sets is crucial. To understand and interpret text, modern natural language processing (NLP) tools employ a technique called vectorization, where textual information is converted into numerical format. In this blog post, we’ll explore how vector search and embeddings are used to find similar or relevant texts, and we’ll examine a real-world code example.
What Are Embeddings?
Embeddings are the core concept behind converting text into numerical vectors. An embedding is essentially a multi-dimensional representation of a word or a group of words. These numerical representations capture the meaning and semantic relationships between words, allowing computers to “understand” text.
What Is Vector Search?
Vector search refers to finding the closest vectors in a given space that are relevant or similar to a particular query vector. It’s like searching for items in a database, but instead of matching text, you’re matching mathematical vectors. This search is often conducted using a measure called cosine similarity, which computes the cosine of the angle between two vectors.
Code Example: Finding Relevant Texts with Sentence Embeddings
Let’s delve into a Python code example to see these concepts in action. We’ll use the Sentence Transformer library to create embeddings and the sklearn
library for calculating cosine similarity. We need to use numpy for sorting the results of the cosine_similarity
function, openai
Python package for making calls to OpenAI’s API.
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import openai
from dotenv import load_dotenv
Step 1: Setting up the Environment
First, we need to install these packages:
pip install sentence-transformers scikit-learn numpy openai python-dotenv
Next, we’ll initialize our OpenAI API key we need by loading our .env
file, suppressing a specific parallelism warning arising from the sentence_transformers
package (not a concern for our example script). If you need an OpenAI API key, here’s some instructions. For this blog post we are using the Python package dotenv
to load a local .env
file for simplicity.
Assuming your .env
located right next to your Python script looks like:
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
Then we can load our OpenAI API key:
# loads our ".env", assumes it is in the same directory as this Python script
load_dotenv()
# This script is single-threaded,
# so it's safe to suppress the TOKENIZERS_PARALLELISM warning
os.environ["TOKENIZERS_PARALLELISM"] = "true"
# Load our OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')
Step 2: Creating Text Embeddings
We use the Sentence Transformer model bert-base-nli-mean-tokens
to convert example texts into numerical vectors (embeddings).
# Initialize our model
model = SentenceTransformer('bert-base-nli-mean-tokens')
# Example texts for our queries later
texts = [
"John loves playing basketball on the weekends.",
"Emily is a huge fan of soccer and never misses a game.",
"Mike enjoys going golfing with his friends.",
"Sarah's favorite sport is tennis, and she plays every Thursday.",
"Tom and his friends are passionate about baseball and often watch games together."
]
# Create our embeddings. For this blog post, we are just loading in memory.
# I'll explain better approaches later in the post.
text_vectors = model.encode(texts)
Step 3: Defining the Search Function
We define a function get_relevant_texts
to take a query string and find the relevant texts from our corpus.
# perform the vector search to find the relevant texts
def get_relevant_texts(query: str):
query_vector = model.encode([query])
similarities = cosine_similarity(query_vector, text_vectors)
indices = np.argsort(similarities[0])[::-1]
return indices[:2]
This function calculates the cosine similarity between the query vector and the text vectors and returns the indices of the two most relevant texts.
Step 4: Using OpenAI’s GPT-4 for Contextual Responses
We take the found texts and construct a chat prompt to send to OpenAI’s GPT-3.5-turbo model, allowing it to respond with context-aware answers.
# Encapsulate the call to OpenAI gpt-3.5-turbo model
def get_response_with_context(text: str):
relevant_texts_indices = get_relevant_texts(text)
relevant_texts = [texts[i] for i in relevant_texts_indices]
content = 'text: '.join(relevant_texts) + ' user:' + text
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a QA bot given texts to answer questions."},
{"role": "user", "content": content}],
# Temperature set to 0 to reduce the randomness of the response.
# Better for applications that expect consistent responses.
temperature=0,
max_tokens=512)
return response.choices[0].message.content
Step 5: Testing the Code
We can then test our code with specific queries to see the relevant responses based on the context provided.
test_query = "What does John do on the weekend?"
answer = get_response_with_context(text=test_query)
# Should respond with something like:
# "John plays basketball on the weekends."
print(answer)
# Another example
test_query = "Who likes tennis?"
answer = get_response_with_context(
text=test_query)
# Should respond with something like:
# "Sarah likes tennis."
print(answer)
Next Steps
While in this blog post I’ve only covered a basic introduction to using embeddings, Here are some next steps to consider:
-
Vector Databases: As your data grows, efficiently searching through millions of vectors can become a challenge. Specialized vector databases like FAISS, Annoy, or Elasticsearch’s vector search capabilities can be explored to manage and search through large-scale vector data. Your sentence is grammatically correct. In addition, databases like SQLite and PostgreSQL have extensions, such as sqlite-vss and pgvector, that can be used to store and query vector embeddings, respectively.
-
More Robust Ways of Creating Embeddings: While the example provided in this blog post offers a straightforward approach to creating embeddings, various other methods and models can be explored. Pre-trained models like BERT, RoBERTa, and DistilBERT offer different characteristics and performance. Investigate these alternatives to find the optimal approach for your specific task, as well as using the OpenAI and Cohere APIs for creating vector embeddings as well.
Conclusion
Vector search and embeddings are powerful tools in modern NLP. They allow us to represent text in a format that machines can interpret, enabling us to find relevant information and derive insights, and provide one way to overcome the current limitations of the context window of large language models. The code example demonstrates how these concepts can be applied using popular libraries to search within a set of texts and obtain contextual answers. Whether it’s for a chatbot or a recommendation engine, these techniques can play a vital role in various applications.
Thanks for reading this far! Here’s the full example code used in this blog post:
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import openai
from dotenv import load_dotenv
# loads our ".env", assumes it is in the same directory as this Python script
load_dotenv()
# This script is single-threaded,
# so it's safe to suppress the TOKENIZERS_PARALLELISM warning
os.environ["TOKENIZERS_PARALLELISM"] = "true"
# Initialize our model
model = SentenceTransformer('bert-base-nli-mean-tokens')
texts = [
"John loves playing basketball on the weekends.",
"Emily is a huge fan of soccer and never misses a game.",
"Mike enjoys going golfing with his friends.",
"Sarah's favorite sport is tennis, and she plays every Thursday.",
"Tom and his friends are passionate about baseball and often watch games together."
]
# Create our embeddings. For this blog post, we are just loading in memory.
# I'll explain better approaches later in the post.
text_vectors = model.encode(texts)
# Load our OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')
# perform the vector search to find the relevant texts
def get_relevant_texts(query: str):
query_vector = model.encode([query])
similarities = cosine_similarity(query_vector, text_vectors)
indices = np.argsort(similarities[0])[::-1]
return indices[:2]
# Encapsulate the call to OpenAI gpt-3.5-turbo model
def get_response_with_context(text: str):
relevant_texts_indices = get_relevant_texts(text)
relevant_texts = [texts[i] for i in relevant_texts_indices]
content = 'text: '.join(relevant_texts) + ' user:' + text
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a QA bot given texts to answer questions."},
{"role": "user", "content": content}],
# Temperature set to 0 to reduce the randomness of the response.
# Better for applications that expect consistent responses
temperature=0,
max_tokens=512)
return response.choices[0].message.content
test_query = "What does John do on the weekend?"
answer = get_response_with_context(text=test_query)
# Should respond with something like:
# "John plays basketball on the weekends."
print(answer)
# Another example
test_query = "Who enjoys tennis?"
answer = get_response_with_context(
text=test_query)
# Should respond with something like:
# "Sarah enjoys tennis."
print(answer)