New

Evaluating LLMs with Uptrain

Notebook

SingleStore Notebooks

Evaluating LLMs with Uptrain

Scoring LLM Results with UpTrain and SingleStoreDB

Welcome to this comprehensive guide on evaluating LLM applications and experimenting with different retrieval configurations with UpTrain and SingleStore. This guide aims to provide a seamless experience, offering step-by-step instructions, code explanations, and best practices.

Overview

UpTrain is an open-source LLM evaluation tool. It provides pre-built metrics to check LLM responses on aspects such as correctness, hallucination, toxicity, etc. as well as provides an easy-to-use framework to configure custom checks. On the other hand, SingleStoreDB offers a fast, scalable, and SQL-compliant relational database system. By combining the power of UpTrain's evaluation capabilities with the efficient storage and retrieval mechanisms of SingleStoreDB, we can create highly performant LLM applications.

What You'll Learn

Setting up your environment with the necessary packages and credentials.
Creating a simple RAG-based application using SingleStoreDB, OpenAI, and Langchain.
Storing and managing data efficiently using SingleStoreDB.
Leveraging the power of UpTrain to evaluate the quality of our application.
Experimenting with different chunking strategies and quantifying the results.
Utilizing UpTrain's framework for data-driven experimentation and refinement.

Prerequisites

Basic knowledge of Python programming.
An UpTrain account.
A SingleStoreDB workspace.

Let's dive in and start building!

Create a workspace in your workspace group

S-00 is sufficient.

Create a Database named evaluate_llm

In [1]:

1%%sql2
3DROP DATABASE IF EXISTS evaluate_llm;4CREATE DATABASE evaluate_llm;

Setting up the environment: Before we begin, it's essential to ensure all the necessary packages are installed. Run the cell below to install the required libraries for our project. This will install uptrain, openai, langchain, and singlestoredb.

In [2]:

1%pip install uptrain==0.7.1 openai==1.6.1 langchain==0.1.4 tiktoken==0.5.2 --quiet

Authentication: The next step involves setting the required environment variables - mainly the openai key (for generating responses), singlestoredb (for context retrieval), and uptrain api key (for evaluating responses). You can create an account with UpTrain and generate the api key for free. Please visit https://uptrain.ai/

In [3]:

1import getpass2import os3
4os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')5
6import openai7
8client = openai.OpenAI()

In [4]:

1UPTRAIN_API_KEY = getpass.getpass('Uptrain API Key: ')

Importing Necessary Modules: With the initial setup complete, let's import the essential classes and modules we'll use throughout this project. The following cell imports the required classes from langchain and SingleStoreDB.

In [5]:

1import singlestoredb2from uptrain import APIClient, Evals3from langchain.vectorstores import SingleStoreDB4from langchain.embeddings import OpenAIEmbeddings

Loading Data from the Web: Our application requires data to process and generate insights. In this step, we'll fetch content from a URL using the WebBaseLoader class. The loaded data will be stored in the data variable. You can replace the URL with any other source if needed.

In [6]:

1from langchain.document_loaders import WebBaseLoader2
3loader = WebBaseLoader('https://cloud.google.com/vertex-ai/docs/generative-ai/learn/generative-ai-studio')4data = loader.load()

Splitting the Data: To process the data more efficiently, we'll split the loaded content into smaller chunks. The RecursiveCharacterTextSplitter class helps in achieving this by dividing the data based on specified character limits.

In [7]:

1from langchain.text_splitter import RecursiveCharacterTextSplitter2
3text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)4all_splits = text_splitter.split_documents(data)

Setting Up SingleStoreDB with OpenAI Embeddings: For efficient storage and retrieval of our data, we use SingleStoreDB in conjunction with OpenAI embeddings. The following cell sets up the necessary environment variables and initializes the SingleStoreDB instance with OpenAI embeddings. Ensure you have the correct SingleStoreDB URL and credentials set.

In [8]:

1import os2from langchain.vectorstores import SingleStoreDB3from langchain.embeddings import OpenAIEmbeddings4from singlestoredb import create_engine5
6conn = create_engine().connect()7
8vectorstore = SingleStoreDB.from_documents(documents=all_splits,9                                           embedding=OpenAIEmbeddings(),10                                           table_name='vertex_ai_docs_chunk_size_200')

Setting Up the QA Prompt: Once our data is processed and stored, we can use it to answer queries. The following cell defines a generate_llm_response which finds the document closest to the given question via vector similarity search and uses OpenAI's GPT-3.5-Turbo to generate the response.

In [9]:

1def generate_llm_response(question, vectorstore):2    documents = vectorstore.similarity_search(question, k=1)3    context = " , ".join([x.page_content for x in documents])4
5    prompt = f"""6        Answer the following user query using the retrieved document in less than 3 sentences:7        {question}8        The retrieved document has the following text:9        {context}10
11        Answer:12    """13
14    response = client.chat.completions.create(15        model="gpt-3.5-turbo", messages=[{"role": "system", "content": prompt}], temperature=0.116    ).choices[0].message.content17
18    return [{'question': question, 'context': context, 'response': response}]

In [10]:

Let's try it out: Let's try asking our QnA bot about Vertex AI.

In [11]:

1generate_llm_response('What is Vertex AI?', vectorstore)

Let's define more questions: We now define a set of questions to test our bot upon and evaluate the quality of responses.

In [12]:

1questions = [2    "What is the primary purpose of Generative AI Studio?",3    'What is Responsible AI?',4    'What is Prompt Designing?',5    'What is Vertex AI?',6    "What are some of the tasks you can perform in Generative AI Studio?",7    'Which method is good, Prompt Designing or fine-tuning a model?',8    'How to get good quality responses from llm?',9    'What are some of the foundation models offered by Vertex AI?',10    "How can you ensure that a designed prompt elicits the desired response from a language model?",11    'How to use Generative AI studio to convert text to speech.',12    "Where can you find sample prompts to test models in Generative AI Studio?"13    'How can I customize the foundation models offered by vertex AI?',14    "What are some code examples from vertex ai?"15]16
17results = []18for question in questions:19    results.extend(generate_llm_response(question, vectorstore))

Running Evaluations using UpTrain: We now define a set of questions to test our bot upon and evaluate the quality of responses. UpTrain provides an APIClient that can be initialized with UPTRAIN_API_KEY. It provides a log_and_evaluate method which takes the input data to be evaluated along with the list of checks to be run. It returns the scores along with explanations.

In [13]:

1from uptrain import APIClient2
3eval_client = APIClient(uptrain_api_key=UPTRAIN_API_KEY)4
5eval_client.log_and_evaluate(6    project_name='VertexAI-QnA-Bot-Evals',7    data=results,8    checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY]9);

Access UpTrain Dashboards: We can access the evaluation results at https://demo.uptrain.ai/dashboard/ - the same API key can be used to access the dashboards.

Running Experiments using UpTrain: Let's also see how UpTrain can be used to conduct data-driven experimentation. We will increase the chunk_size from 200 to 1000 and see how that impacts the context retrieval quality.

Generate new embeddings: We will again use SingleStoreDB to store new document embeddings

In [14]:

1text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)2all_splits = text_splitter.split_documents(data)3vectorstore_new = SingleStoreDB.from_documents(documents=all_splits,4                                               embedding=OpenAIEmbeddings(),5                                               table_name='vertex_ai_docs_chunk_size_1000')

Generate responses with new vectorstore: Let's generate new responses for the same set of questions.

In [15]:

1results_larger_chunk = []2for question in questions:3    results_larger_chunk.extend(generate_llm_response(question, vectorstore_new))

Append chunk size information: Let's add the corresponding chunk size information for both sets of results. We will pass this column name to UpTrain to compare the two experiments

In [16]:

1for x in results:2    x.update({'chunk_size': 200})3
4for x in results_larger_chunk:5    x.update({'chunk_size': 1000})

Evaluating Experiments using UpTrain: UpTrain's APIClient also provides a "evaluate_experiments" method which takes the input data to be evaluated along with the list of checks to be run and the name of the columns associated with the experiment.

In [17]:

1eval_client.evaluate_experiments(2    project_name='VertexAI-QnA-Bot-Chunk-Size-Experiments',3    data=results + results_larger_chunk,4    checks=[Evals.CONTEXT_RELEVANCE],5    exp_columns=['chunk_size']6);

Details

About this Template

Using Uptrain to evaluate LLMs built with SingleStore as the contextual store. This notebook uses OpenAI embedding models and Langchain as a development framework.

This Notebook can be run in Standard and Enterprise deployments.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

License

This Notebook has been released under the Apache 2.0 open source license.

Evaluating LLMs with Uptrain

Notebook

Evaluating LLMs with Uptrain

Scoring LLM Results with UpTrain and SingleStoreDB

Overview

What You'll Learn

Prerequisites

Create a workspace in your workspace group

Create a Database named evaluate_llm

Details

About this Template

Tags

See Notebook in action

License