Semantic Search with OpenAI QA
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
In this Notebook you will use a combination of Semantic Search and a Large Langauge Model (LLM) to build a basic Retrieval Augmented Generation (RAG) application. For a great introduction into what RAG is, please read A Beginner's Guide to Retrieval Augmented Generation (RAG).
Prerequisites for interacting with ChatGPT
Install OpenAI package
Let's start by installing the openai Python package.
In [1]:
!pip install openai==1.3.3 --quiet
Connect to ChatGPT and display the response
In [2]:
import openaiEMBEDDING_MODEL = "text-embedding-ada-002"GPT_MODEL = "gpt-3.5-turbo"
You will need an OpenAI API key in order to use the the openai
Python library.
In [3]:
import getpassimport osos.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')client = openai.OpenAI()
Test the connection.
In [4]:
response = client.chat.completions.create(model=GPT_MODEL,messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who won the gold medal for curling in Olymics 2022?"},])print(response.choices[0].message.content)
Get the data about Winter Olympics and provide the information to ChatGPT as context
1. Install and import libraries
In [5]:
!pip install tabulate tiktoken wget --quiet
In [6]:
import jsonimport numpy as npimport osimport pandas as pdimport wget
2. Fetch the CSV data and read it into a DataFrame
Download pre-chunked text and pre-computed embeddings. This file is ~200 MB, so may take a minute depending on your connection speed.
In [7]:
embeddings_url = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"embeddings_path = "winter_olympics_2022.csv"if not os.path.exists(embeddings_path):wget.download(embeddings_url, embeddings_path)print("File downloaded successfully.")else:print("File already exists in the local file system.")
Here we are using the converters=
parameter of the pd.read_csv
function to convert the JSON
array in the CSV file to numpy arrays.
In [8]:
def json_to_numpy_array(x: str | None) -> np.ndarray | None:"""Convert JSON array string into numpy array."""return np.array(json.loads(x)) if x else Nonedf = pd.read_csv(embeddings_path, converters=dict(embedding=json_to_numpy_array))df
In [9]:
df.info(show_counts=True)
3. Set up the database
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Create the database.
In [10]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS winter_wikipedia;%sql CREATE DATABASE winter_wikipedia;
Action Required
Make sure to select the winter_wikipedia database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.
In [11]:
%%sqlCREATE TABLE IF NOT EXISTS winter_olympics_2022 /* Creating table for sample data. */(id INT PRIMARY KEY,text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,embedding BLOB);
4. Populate the table with our DataFrame
Create a SQLAlchemy connection.
In [12]:
import singlestoredb as s2conn = s2.create_engine().connect()
Use the to_sql
method of the DataFrame to upload the data to the requested table.
In [13]:
df.to_sql('winter_olympics_2022', con=conn, index=True, index_label='id', if_exists='append', chunksize=1000)
5. Do a semantic search with the same question from above and use the response to send to OpenAI again
In [14]:
import sqlalchemy as sadef get_embedding(text: str, model: str = 'text-embedding-ada-002') -> str:"""Return the embeddings."""return [x.embedding for x in client.embeddings.create(input=[text], model=model).data][0]def strings_ranked_by_relatedness(query: str,df: pd.DataFrame,table_name: str,relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),top_n: int=100,) -> tuple:"""Returns a list of strings and relatednesses, sorted from most related to least."""# Get the embedding of the query.query_embedding_response = get_embedding(query, EMBEDDING_MODEL)# Create the SQL statement.stmt = sa.text(f"""SELECTtext,DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(:embedding), embedding) AS scoreFROM {table_name}ORDER BY score DESCLIMIT :limit""")# Execute the SQL statement.results = conn.execute(stmt, dict(embedding=json.dumps(query_embedding_response), limit=top_n))strings = []relatednesses = []for row in results:strings.append(row[0])relatednesses.append(row[1])# Return the results.return strings[:top_n], relatednesses[:top_n]
In [15]:
from tabulate import tabulatestrings, relatednesses = strings_ranked_by_relatedness("curling gold medal",df,"winter_olympics_2022",top_n=5)for string, relatedness in zip(strings, relatednesses):print(f"{relatedness=:.3f}")print(tabulate([[string]], headers=['Result'], tablefmt='fancy_grid'))print('\n\n')
In [16]:
import tiktokendef num_tokens(text: str, model: str=GPT_MODEL) -> int:"""Return the number of tokens in a string."""encoding = tiktoken.encoding_for_model(model)return len(encoding.encode(text))def query_message(query: str,df: pd.DataFrame,model: str,token_budget: int) -> str:"""Return a message for GPT, with relevant source texts pulled from SingleStoreDB."""strings, relatednesses = strings_ranked_by_relatedness(query, df, "winter_olympics_2022")introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'question = f"\n\nQuestion: {query}"message = introductionfor string in strings:next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'if (num_tokens(message + next_article + question, model=model)> token_budget):breakelse:message += next_articlereturn message + questiondef ask(query: str,df: pd.DataFrame=df,model: str=GPT_MODEL,token_budget: int=4096 - 500,print_message: bool=False,) -> str:"""Answers a query using GPT and a table of relevant texts and embeddings in SingleStoreDB."""message = query_message(query, df, model=model, token_budget=token_budget)if print_message:print(message)messages = [{"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},{"role": "user", "content": message},]response = client.chat.completions.create(model=model,messages=messages,temperature=0)response_message = response.choices[0].message.contentreturn response_message
In [17]:
print(ask('Who won the gold medal for curling in Olymics 2022?'))
Clean up
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
In [18]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS winter_wikipedia;
Details
About this Template
Provide context to chatGPT using data stored in SingleStoreDB.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.