New

A Deep Dive Into Vector Databases

Notebook


SingleStore Notebooks

A Deep Dive Into Vector Databases

Required Installations

In [1]:

1!pip install openai numpy pandas singlestoredb langchain==0.1.8 langchain-community==0.0.21 langchain-core==0.1.25 langchain-openai==0.0.6

Vector Embedding Example

In this example, we demonstrate a rule based system that generates vector embeddings based on a word. The embedding that we generate contains 5 main features:

  • Length of word

  • Number of vowels in the word (normalized to the length of the word)

  • Whether the word starts with a vowel (1) or not (0)

  • Whether the word ends with a vowel (1) or not (0)

  • Percentage of consonants in the word

This is a simple implementation of a rule based system to demonstrate the essence of what vector embedding models do. However, they utlize neural networks that are trained on vast datasets to learn key features and self-corrects using gradient descent.

In [2]:

1def word_to_vector(word):2    # Define some basic rules for our vector components3    vector = [0] * 5  # Initialize a vector of 5 dimensions4
5    # Rule 1: Length of the word (normalized to a max of 10 characters for simplicity)6    vector[0] = len(word) / 107
8    # Rule 2: Number of vowels in the word (normalized to the length of the word)9    vowels = 'aeiou'10    vector[1] = sum(1 for char in word if char in vowels) / len(word)11
12    # Rule 3: Whether the word starts with a vowel (1) or not (0)13    vector[2] = 1 if word[0] in vowels else 014
15    # Rule 4: Whether the word ends with a vowel (1) or not (0)16    vector[3] = 1 if word[-1] in vowels else 017
18    # Rule 5: Percentage of consonants in the word19    vector[4] = sum(1 for char in word if char not in vowels and char.isalpha()) / len(word)20
21    return vector22
23# Example usage24word = "example"25vector = word_to_vector(word)26print(f"Word: {word}\nVector: {vector}")

Vector Similarity Example

In this example, we demonstrate a way to determine the similarity between two vectors. There are many techniques to find the similiarity between two vectors but one of the most popular ways is using cosine similarity. Consine similarity is the the dot product between the two vectors divided by the product of the vector's normals (magnitudes).

This is just an example to show how vector databases search for similar vectors. The fundamental problem with a system like this is our rule-based embedding because it does not give us a semantic understanding of the word/sentences/paragraphs. Instead, it gives us a classification of a single word's structure.

In [3]:

1import numpy as np2
3def cosine_similarity(vector_a, vector_b):4    # Calculate the dot product of vectors5    dot_product = np.dot(vector_a, vector_b)6    # Calculate the norm (magnitude) of each vector7    norm_a = np.linalg.norm(vector_a)8    norm_b = np.linalg.norm(vector_b)9    # Calculate cosine similarity10    similarity = dot_product / (norm_a * norm_b)11    return similarity12
13# Example usage14word1 = "example"15word2 = "sample"16vector1 = word_to_vector(word1)17vector2 = word_to_vector(word2)18
19# Calculate and print cosine similarity20similarity_score = cosine_similarity(vector1, vector2)21print(f"Cosine similarity between '{word1}' and '{word2}': {similarity_score}")

Embedding Models

In order to generate semantic understanding of language within vectors, embedding models are required. Embedding models are trained on vast corpus of language data. Training embedding models starts by initializing word embeddings with random vectors. Each word in the vocabulary is assigned a vector of real numbers. They use neural networks trained on large datasets to predict a word from its context (Continuous Bag of Words model) or to predict the context given a word (Skip-Gram model). During training, the model adjusts the word vectors to minimize some loss function, often related to the likelihood of observing a word given its context (or vice versa) through gradient descent.

Examples of embedding models include Word2Vec, GloVe, BERT, OpenAI text-embedding.

In [4]:

1OPENAI_KEY = "INSERT OPENAI KEY"2
3from openai import OpenAI4client = OpenAI(api_key=OPENAI_KEY)5
6def openAIEmbeddings(input):7  response = client.embeddings.create(8      input="input",9      model="text-embedding-3-small"10  )11  return response.data[0].embedding12
13print(openAIEmbeddings("Golden Retreiver"))

As you can see, this is a huge vector! Over 1000 dimensions just in this one vector. This is why it is important for us to have good dimensionality reduction techniques during the similarity searches.

Creating a vector database with SingleStoreDB

In the following code we create a vector datbase with SingleStoreDB. We utilize Langchain to chunk and split the raw text into documents and use the OpenAI embeddings model to generate the vector embeddings. We then take the raw documents and embeddings and create a table with the columns "docs" and "embeddings".

To test this out, we perform a similarity search based on a query and it returns the most similar document in the vector database.

In [5]:

1import openai2from langchain.text_splitter import CharacterTextSplitter3from langchain_community.document_loaders import TextLoader4from langchain_community.embeddings import OpenAIEmbeddings5from langchain_community.vectorstores.singlestoredb import SingleStoreDB6from openai import OpenAI7import os8import pandas as pd9
10
11# Load and process documents12loader = TextLoader("michael_jackson.txt") # use your own document13
14documents = loader.load()15text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)16docs = text_splitter.split_documents(documents)17
18# Generate embeddings and create a document search database19embeddings = OpenAIEmbeddings(api_key=OPENAI_KEY)20
21# Create Vector Database22vector_database = SingleStoreDB.from_documents(docs, embeddings, table_name="mjackson") # create your own table23
24query = "How old was Michael Jackson when he died?"25docs = vector_database.similarity_search(query)26print(docs[0].page_content)

Retrieval Augmented Generation System

RAG combines large language models with a retrieval mechanism to search a database for relevant information before generating responses. It utilizes real-world data from retrieved documents to ground responses, enhancing factual accuracy and reducing hallucinations. Documents are vectorized using embeddings and stored in a vector database for efficient retrieval. SingleStoreDB serves as a great vector database. The user query is converted into a vector, and a vector search is performed in the database to find documents relevant to that specific query. The system returns the documents with the highest relevance scores, which are then fed to the chatbot for generating informed responses.

In [6]:

1import os2import openai3from langchain.text_splitter import CharacterTextSplitter4from langchain_community.document_loaders import TextLoader5from langchain_community.embeddings import OpenAIEmbeddings6from langchain_community.vectorstores.singlestoredb import SingleStoreDB7from openai import OpenAI8
9# Set up API keys and database URL10client = OpenAI(api_key=OPENAI_KEY)11
12# Load and process documents13loader = TextLoader("michael_jackson.txt")14documents = loader.load()15text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)16docs = text_splitter.split_documents(documents)17
18# Generate embeddings and create a document search database19embeddings = OpenAIEmbeddings(OPENAI_KEY)20docsearch = SingleStoreDB.from_documents(docs, embeddings, table_name="mjackson")21
22# Chat loop23while True:24    # Get user input25    user_query = input("\nYou: ")26
27    # Check for exit command28    if user_query.lower() in ['quit', 'exit']:29        print("Exiting chatbot.")30        break31
32    # Perform similarity search33    docs = docsearch.similarity_search(user_query)34    if docs:35        context = docs[0].page_content36
37        # Generate response using OpenAI GPT-438        response = client.chat.completions.create(39            model="gpt-4",40            messages=[41                {"role": "system", "content": "Context: " + context},42                {"role": "user", "content": user_query}43            ],44            stream=True,45            max_tokens=500,46        )47
48        # Output the response49        print("AI: ", end="")50        for chunk in response:51            if chunk.choices[0].delta.content is not None:52                print(chunk.choices[0].delta.content, end="")53
54    else:55        print("AI: Sorry, I couldn't find relevant information.")

Details


About this Template

Using SingleStoreDB as a vector database and vector database use cases.

This Notebook can be run in Standard and Enterprise deployments.

Tags

vectorembeddings

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.