New

Vector Search with Kai

Notebook

SingleStore Notebooks

Vector Search with Kai

In this notebook, we load a dataset into a collection, create a vector index and perform vector searches using Kai in a way that is compatible with MongoDB clients and applications

In [1]:

!pip install datasets --quiet

In [2]:

1
import os
2
import pprint
3
import time
4
import concurrent.futures
5
import datasets
6
from pymongo import MongoClient
7
from datasets import load_dataset
8
from bson import json_util

1. Initializing a pymongo client

In [3]:

1
current_database = %sql SELECT DATABASE() as CurrentDatabase
2
DB = current_database[0][0]
3
COLLECTION = 'wiki_embeddings'

In [4]:

1
# Using the environment variable that holds the kai endpoint
2
client = MongoClient(connection_url_kai)
3
collection = client[DB][COLLECTION]

2. Create a collection and load the dataset

It is recommended that you create a collection with the embedding field as a top level column for optimized utilization of storage. The name of the column should be the name of the field holding the embedding

In [5]:

1
client[DB].create_collection(COLLECTION,
2
  columns=[{ 'id': "emb", 'type': "VECTOR(768) NOT NULL" }],
3
);

In [6]:

1
# Using the "wikipedia-22-12-simple-embeddings" dataset from Hugging Face
2
dataset = load_dataset("Cohere/wikipedia-22-12-simple-embeddings", split="train")

In [7]:

1
DB_SIZE = 50000 #Currently loading 50k documents to the collection, can go to a max of 485,859 for this dataset
2
insert_data = []
3
insert_count = 0
4
# Iterate through the dataset and prepare the documents for insertion
5
# The script below ingests 1000 records into the database at a time
6
for item in dataset:
7
    if insert_count >= DB_SIZE:
8
        break
9
    # Convert the dataset item to MongoDB document format
10
    doc_item = json_util.loads(json_util.dumps(item))
11
    insert_data.append(doc_item)
12

13
    # Insert in batches of 1000 documents
14
    if len(insert_data) == 1000:
15
        collection.insert_many(insert_data)
16
        insert_count += 1000
17
        print(f"{insert_count} of {DB_SIZE} records ingested")
18
        insert_data = []
19

20

21
# Insert any remaining documents
22
if len(insert_data) > 0:
23
    collection.insert_many(insert_data)
24
    print("Data Ingested")

A sample document from the collection

In [8]:

1
sample_doc = collection.find_one()
2
pprint.pprint(sample_doc, compact=True)

3. Create a vector Index

In [9]:

1
client[DB].command({
2
    'createIndexes': COLLECTION,
3
    'indexes': [{
4
        'key': {'emb': 'vector'},
5
        'name': 'vector_index',
6
        'kaiSearchOptions': {"index_type":"AUTO", "metric_type": "EUCLIDEAN_DISTANCE", "dimensions": 768}
7
    }],
8
})

Selecting the query embedding from the sample_doc selected above

In [10]:

1
# input vector
2
query_vector = sample_doc['emb']

4. Perform a vector search

In [11]:

1
def execute_kai_search(query_vector):
2
    pipeline = [
3
        {
4
            '$vectorSearch': {
5
                "index": "vector_index",
6
                "path": "emb",
7
                "queryVector": query_vector,
8
                "numCandidates": 20,
9
                "limit": 3,
10
            }
11
        },
12
        {
13
            '$project': {
14
               '_id':1,
15
               'text': 1,
16
            }
17
        }
18
    ]
19
    results = collection.aggregate(pipeline)
20
    return list(results)

In [12]:

1
execute_kai_search(query_vector)

Running concurrent vector search queries

In [13]:

1
num_concurrent_queries = 250
2
start_time = time.time()
3

4
with concurrent.futures.ThreadPoolExecutor(max_workers=num_concurrent_queries) as executor:
5
    futures = [executor.submit(execute_kai_search, query_vector) for _ in range(num_concurrent_queries)]
6
    concurrent.futures.wait(futures)
7

8
end_time = time.time()
9
print(f"Executed {num_concurrent_queries} concurrent queries.")
10
print(f"Total execution time: {end_time - start_time} seconds")
11

12
for f in futures:
13
    if f.exception() is not None:
14
        print(f.exception())
15
failed_count = sum(1 for f in futures if f.exception() is not None)
16
print(f"Failed queries: {failed_count}")

This shows the Kai can create vector indexes instantaneously and perform a large number of concurrent vector search queries surpassing MongoDB Atlas Vector Search capabilities

Details

About this Template

Run Vector Search using MongoDB clients and power GenAI usecases for your MongoDB applications

This Notebook can be run in Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

Vector Search with Kai

Notebook

Vector Search with Kai

Vector Search with Kai

1. Initializing a pymongo client

2. Create a collection and load the dataset

3. Create a vector Index

4. Perform a vector search

Details

About this Template

Tags

License

See Notebook in action