Semantic Search with OpenAI Embedding Creation
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
In this notebook, we will demonstrate an example of conducting semantic search on SingleStoreDB with SQL! Unlike traditional keyword-based search methods, semantic search algorithms take into account the relationships between words and their meanings, enabling them to deliver more accurate and relevant results – even when search terms are vague or ambiguous.
SingleStoreDB’s built-in parallelization and Intel SIMD-based vector processing takes care of the heavy lifting involved in processing vector data. This allows your to run your ML algorithms right in your database extremely efficiently with just 2 lines of SQL!
In this example, we use Open AI embeddings API to create embeddings for our dataset and run semantic_search using dot_product vector matching function!
1. Create a workspace in your workspace group
S-00 is sufficient.
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
2. Create a Database named semantic_search
In [1]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS semantic_search;%sql CREATE DATABASE semantic_search;
Action Required
Make sure to select the semantic_search database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.
3. Install and import required libraries
We will use the OpenAI embeddings API and will need to import the relevant dependencies accordingly.
In [2]:
!pip3 install openai==1.3.3 requests --quietimport jsonimport osfrom openai import OpenAIimport requests
4. Create an OpenAI account and get API connection details
To vectorize and embed the employee reviews and query strings, we leverage OpenAI's embeddings API. To use this API, you will need an API key, which you can get here. You'll need to add a payment method to actually get vector embeddings using the API, though the charges are minimal for a small example like we present here.
Action Required
You will have to update your notebook's firewall settings to include *.*.openai.com in order to get embedddings from OpenAI APIS.
In [3]:
import getpassos.environ["OPENAI_API_KEY"] = getpass.getpass('OpenAI API Key: ')client = OpenAI()
5. Create a new table in your database called reviews
In [4]:
%%sqlDROP TABLE IF EXISTS reviews;CREATE TABLE reviews (date_review VARCHAR(255),job_title VARCHAR(255),location VARCHAR(255),review TEXT);
6. Import our sample data into your table
This dataset has 15 reviews left by anonymous employees of a firm.
In [5]:
url = 'https://raw.githubusercontent.com/singlestore-labs/singlestoredb-samples/main/Tutorials/ai-powered-semantic-search/hr_sample_data.sql'
Note that we are using the %sql
magic command here to run a query against the currently
selected database.
In [6]:
for query in [x for x in requests.get(url).text.split('\n') if x.strip()]:%sql {{query}}
7. Add vector embeddings for each review
To embed the reviews in our SingleStoreDB database, we iterate through each row in the table, make a call to OpenAI’s embeddings API with the text in the reviews field and update the new column called embeddings for each entry.
In [7]:
%sql ALTER TABLE reviews ADD embeddings BLOB;from typing import Listreviews = %sql SELECT review FROM reviews;reviews = [x.review for x in reviews]def get_embeddings(inputs: List[str], model: str = 'text-embedding-ada-002') -> List[str]:"""Return list of embeddings."""return [x.embedding for x in client.embeddings.create(input=inputs, model=model).data]embeddings = get_embeddings(reviews)for embedding, review in zip(embeddings, reviews):%sql UPDATE reviews SET embeddings = JSON_ARRAY_PACK('{{json.dumps(embedding)}}') WHERE review='{{review}}';
8. Run the semantic search algorithm with just one line of SQL
We will utilize SingleStoreDB's distributed architecture to efficiently compute the dot product of the input string (stored in searchstring) with each entry in the database and return the top 5 reviews with the highest dot product score. Each vector is normalized to length 1, hence the dot product function essentially computes the cosine similarity between two vectors – an appropriate nearness metric. SingleStoreDB makes this extremely fast because it compiles queries to machine code and runs dot_product using SIMD instructions.
In [8]:
searchstring = input('Please enter a search string: ')search_embedding = json.dumps(get_embeddings([searchstring])[0])results = %sql SELECT review, DOT_PRODUCT(embeddings, JSON_ARRAY_PACK('{{search_embedding}}')) AS score FROM reviews ORDER BY score DESC LIMIT 5;print()for i, res in enumerate(results):print(f'{i + 1}: {res.review} Score: {res.score}\n')
9. Clean up
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
In [9]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS semantic_search;
Details
About this Template
Generate embeddings and run semantic search in your database in SQL.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.