RAG Tutorial: A Beginner's Guide to Retrieval Augmented Generation

New to the world of Retrieval Augmented Generation (RAG)? We've got you covered with this in-depth guide on what it is, advantages and real-time use cases.

Large language models (LLMs) are becoming the backbone of most of the organizations these days as the whole world is making the transition towards AI. While LLMs are all good and trending for all the positive reasons, they also pose some disadvantages if not used properly. Yes, LLMs can sometimes produce the responses that aren’t expected, they can be fake, made up information or even biased. Now, this can happen for various reasons. We call this process of generating misinformation by LLMs as hallucination. There are some notable approaches to mitigate the LLM hallucinations such as fine-tuning, prompt engineering, retrieval augmented generation (RAG) etc., which significantly improve model performance. Retrieval augmented generation (RAG) has been the most talked about approach in mitigating the hallucinations faced by large language models. Today we will show you how the RAG approach works.

RAG and the retrieval process it implements is a systematic method for searching and gathering relevant information from data sources in response to user queries. This process is crucial in enhancing applications in fields such as healthcare, where it aids in accurately diagnosing conditions and formulating treatment plans by merging the latest research findings and medical information.

What is Retrieval Augmented Generation (RAG)?

To elaborate further on Retrieval Augmented Generation, usually shortened to "RAG", it is a cutting-edge technique in artificial intelligence (AI) that combines the strengths of traditional language models with the ability to dynamically incorporate relevant external data. This innovative approach significantly enhances the accuracy and relevance of machine-generated responses by grounding them in real-time information. By leveraging retrieval augmented generation, AI systems can provide more precise and contextually appropriate answers, making them more reliable and useful in various applications.

Definition and purpose of RAG

RAG is a specialized model that integrates retrieval and generation processes to boost the performance of Large Language Models (LLMs). The primary purpose of RAG is to improve the contextual accuracy and relevance of AI-generated content. By incorporating real-time external data, RAG ensures that the responses generated by LLMs are not only accurate but also up-to-date and contextually appropriate. In short, the latest data can be injected via prompt or some other mechanism so that the model then has access to it. The result is a much more informed response due to the dynamic nature of the injected data vs. the static data that the model is trained upon. This makes RAG an invaluable tool for applications that require precise and current information.

How RAG enables access to external data

RAG enables access to external data through a sophisticated retrieval component that searches for relevant information from various data sources. Although the data used for RAG could be as simple as plain text injected into a text prompt, this is to simplistic for more advanced use cases. Generally the relevant data is transformed into vector embeddings and stored in a vector database. When a user query is received, the retrieval component converts it into a vector representation and matches it against the stored data to find the most relevant pieces of information. The generation component then uses this retrieved information to form accurate and contextually correct responses. This process ensures that the AI system can provide answers that are both relevant and precise, leveraging the power of vector databases to store and retrieve information efficiently.

Key concepts in RAG

The RAG pipeline basically involves three critical components: Retrieval component, Augmentation component, Generation component.

Retrieval: This component helps you fetch the relevant information from the external knowledge base like a vector database for any given user query. This component is very crucial as this is the first step in curating the meaningful and contextually correct responses.
Augmentation: This part involves enhancing and adding more relevant context to the retrieved response for the user query.
Generation: Finally, a final output is presented to the user with the help of a large language model (LLM). The LLM uses its own knowledge and the provided context and comes up with an apt response to the user’s query.

These three components are the basis of a RAG pipeline to help users to get the contextually-rich and accurate responses they are looking for. That is the reason why RAG is so special when it comes to building chatbots, question-answering systems, etc.

RAG systems can be built by software engineers with medium technical expertise, requiring top-of-the-class embedding models and LLMs for better response generation. However, RAG is resource-intensive and demands expertise in both natural language processing and information retrieval systems. Despite these challenges, RAG remains a powerful technique that can significantly improve the performance of LLMs, making it a valuable tool for applications ranging from chatbots and virtual assistants to language translation systems and healthcare.

Large Language Models (LLMs) sometimes produce hallucinated answers and one of the techniques to mitigate these hallucinations is by RAG. For an user query, RAG tends to retrieve the information from the provided source/information/data that is stored in a vector database. A vector database is the one that is a specialized database other than the traditional databases where vector data is stored. Vector data is in the form of embeddings that captures the context and meaning of the objects.

RAG in action

For example, think of a scenario where you would like to get custom responses from your AI application. First, the organization’s documents are converted into embeddings through an embedding model and stored in a vector database. When a query is sent to the AI application, it gets converted into a vector query embedding and goes through the vector database to find the most similar object by vector similarity search. User input plays a crucial role in this process, as it helps generate and optimize search queries, enhancing the relevance of the retrieved documents. This way, your LLM-powered application doesn’t hallucinate since you have already instructed it to provide custom responses and is fed with the custom data.

One simple use case would be the customer support application, where the custom data is fed to the application stored in a vector database and when a user query comes in, it generates the most appropriate response related to your products or services and not some generic answer. This way, RAG is revolutionizing many other fields in the world.

The key advantage of RAG is that it allows the model to pull in real-time information from external sources, making it more dynamic and adaptable to new information. It’s particularly useful for tasks where the model needs to reference specific details that might not be present in its pre-trained knowledge, like fact-checking or answering questions about recent events.

Advantages of Retrieval Augmented Generation

There are some incredible advantages of RAG for developers and data scientists who implement it. Let me share some notable ones:

Scalability. RAG approach helps you with scale models by simply updating or adding external/custom data to your the external database (vector database).
Memory efficiency. Traditional models like GPT have limits when it comes to pulling fresh and updated information and fails to be memory efficient. RAG leverages external databases like a vector database — allowing it to pull in fresh, updated or detailed information when needed with speed.
Minimize hallucinations. The training data used in RAG is crucial for fine-tuning the model and improving its performance, addressing issues like bias and model hallucination.
Flexibility. By updating or expanding the external knowledge source, you can adapt RAG to build any AI applications with flexibility.

Retrieval Augmented Generation (RAG) applications

RAG can be extremely useful in scenarios where detailed, context-aware answers are required, including:

Question answering systems. Providing detailed and contextually correct answers to user queries by pulling from extensive knowledge bases.
Content creation. Assisting writers/authors or creators by providing relevant and up to date information or facts to enrich their content creation process.
Research assistance. Instead of searching through so many relevant documents and websites on the internet, RAG helps researchers quickly access pertinent data or studies related to their query.

RAG real-time use case example

RAG has a range of potential applications, and one real-life use case is in the domain of chat applications. RAG enhances chatbot capabilities by integrating real-time data. Consider a sports league chatbot. Traditional LLMs can answer historical questions but struggle with recent events, like last night’s game details.

RAG allows the chatbot to access up-to-date databases, news feeds and player bios. This means the system processes the user's query to provide timely, accurate responses about recent games or player injuries. For instance, Cohere’s chatbot provides real-time details about Canary Islands vacation rentals — from beach accessibility to nearby volleyball courts. Essentially, RAG bridges the gap between static LLM knowledge and dynamic, current information.

RAG using LangChain

LangChain revolutionizes RAG by streamlining the interface between vast data repositories and Large Language Models (LLMs). By fragmenting immense data into digestible vectors, LangChain optimizes rapid retrieval. When users input prompts, LangChain swiftly queries its vector store, pinpointing relevant data.

This focused data is then channeled to LLMs which craft precise, context-rich responses. This synergy between LangChain's efficient data management and LLM's generation capabilities ensures users receive accurate, data-backed responses. As an open-source platform, LangChain's approach to RAG heralds a new era in AI-driven, context-aware content generation and retrieval.

Take a look at SingleStore’s integration with LangChain.

The diagram represents the RAG process for AI applications. The flow starts with end users posing a query or "ask" as represented by step 1. This inquiry is directed to a gen AI app, which then proceeds to search and retrieve relevant information from a company data repository, as denoted by step 2. Once the data is fetched, it is used as a prompt to instruct the LLMs in step 3.

The LLMs then generate an appropriate response based on the prompt and the initial query, synthesizing the retrieved data to provide a coherent and informed answer back to the end user. This RAG process combines the capabilities of information retrieval with the advanced generative capabilities of language models to offer detailed, contextually accurate answers.

Fine tuning vs. Retrieval Augmented Generation

Fine-tuning refers to the process of adapting a pre-existing, broadly trained model to a specific task or domain. Initially, an LLM is trained on a vast corpus of data to understand language structures, patterns and nuances, a phase often referred to as "pre-training."

Once this generalized understanding is established, the model can be further refined or fine-tuned on a smaller, specialized dataset that is tailored for a particular application, medical text generation, legal document analysis or customer support responses. This fine-tuning step enables the model to leverage its broad knowledge from pre-training while specializing in the nuances and specifics of the target domain, ensuring better performance on the desired task.

RAG and fine-tuning are both techniques to adapt pre-trained language models to specific tasks or domains. Here's a comparison of the two:

	Retrieval Augmented Generation (RAG)	Fine-tuning
Definition	Combines large-scale knowledge retrieval with sequence generation. Retrieves relevant documents and generates an answer using them.	Refines a pre-trained model on specific tasks using a smaller dataset. Adjusts the weight of the model to specialize it for a particular task.
Advantages	1.Can leverage vast external knowledge	Can achieve strong performance on specific tasks Efficient, especially when limited data is available for the task
Challenges	Requires an efficient retrieval mechanism Potential to retrieve irrelevant documents Computationally more intensive	Risk of overfitting if not enough data is available, or if training is too aggressive Knowledge is limited up to the last training cut-off
Use cases	Open-domain Question Answering (QA) Dynamic responses in chatbots Situations where new data emerges frequently	Task-specific applications like sentiment analysis Niche domains with unique datasets
Examples	OpenAI's RAG model for QA	Fine-tuning GPT models for specific domains or tasks

Launch AI-driven features. Without hitting limits.

Free to start.
Start building in minutes.
Run transactions, analytics & AI.

Focus on building. We’ll handle the performance, scale, and reliability.

Start Building AI Features

SingleStoreDB

Integrating SingleStore with the RAG model in an AI application can be a powerful combination. SingleStoreDB is a distributed, relational database that excels in high-performance, real-time analytics. By integrating SingleStoreDB, you can ensure that the RAG model has fast and efficient access to vast amounts of data, which can be crucial for real-time response generation.

By integrating SingleStoreDB with the RAG model, you can harness the power of real-time analytics and fast data retrieval, ensuring that your chat application provides timely and relevant responses to user queries.

For more in-depth understanding, follow SingleStore CMO Madhukar Kumar’s talk 'Building a Generative AI App on Private Enterprise Data With Retrieval Augmented Generation (RAG)' .

Also, check out this article by Ronny Hoesada, DevRel Engineer at Unstrucutured.io (a SingleStore partner) on Building a Q+A Retrieval Augmented Generation (RAG) System with Slack Data Using Unstructured and SingleStoreDB.

RAG tutorial

Let’s build a simple AI application that fetches contextually relevant information from our own data for any given user query.

Sign up for SingleStore to use it as your AI database. Once you sign up, you need to create a workspace — which is easy and free.

Once you create your workspace, create a database with any name you choose.

As you can see from the preceding screenshot, create the database from the ‘Create Database’ tab on the right side. Now, go to ‘Develop’ to use our Notebooks feature (similar to Jupyter Notebooks).

Create a new Notebook, and name it whatever you’d like.

Before doing anything, select your workspace and database from the dropdown on Notebooks.

Now, start adding the following code snippets into your Notebook you just created.

Install the required libraries

1!pip install openai numpy pandas singlestoredb langchain==0.1.8 langchain-community==0.0.21 langchain-core==0.1.25 langchain-openai==0.0.6

Vector embeddings example

1def word_to_vector(word):2    # Define some basic rules for our vector components3    vector = [0] * 5  # Initialize a vector of 5 dimensions4
5    # Rule 1: Length of the word (normalized to a max of 10 characters for simplicity)6    vector[0] = len(word) / 107
8    # Rule 2: Number of vowels in the word (normalized to the length of the word)9    vowels = 'aeiou'10    vector[1] = sum(1 for char in word if char in vowels) / len(word)11
12    # Rule 3: Whether the word starts with a vowel (1) or not (0)13    vector[2] = 1 if word[0] in vowels else 014
15    # Rule 4: Whether the word ends with a vowel (1) or not (0)16    vector[3] = 1 if word[-1] in vowels else 017
18    # Rule 5: Percentage of consonants in the word19    vector[4] = sum(1 for char in word if char not in vowels and char.isalpha()) / len(word)20
21    return vector22
23# Example usage24word = "example"25vector = word_to_vector(word)26print(f"Word: {word}\nVector: {vector}")

Vector similarity example

1import numpy as np2
3def cosine_similarity(vector_a, vector_b):4    # Calculate the dot product of vectors5    dot_product = np.dot(vector_a, vector_b)6
7    # Calculate the norm (magnitude) of each vector8    # Calculate the norm (magnitude) of each vector9    norm_a = np.linalg.norm(vector_a)10    norm_b = np.linalg.norm(vector_b)11
12    # Calculate cosine similarity13    similarity = dot_product / (norm_a * norm_b)14    return similarity15
16# Example usage17word1 = "example"18word2 = "sample"19vector1 = word_to_vector(word1)20vector2 = word_to_vector(word2)21
22# Calculate and print cosine similarity23similarity_score = cosine_similarity(vector1, vector2)24print(f"Cosine similarity between '{word1}' and '{word2}': {similarity_score}")25

Embedding models

1OPENAI_KEY = "INSERT OPENAI KEY"2from openai import OpenAI3client = OpenAI(api_key=OPENAI_KEY)4
5def openAIEmbeddings(input):6    response = client.embeddings.create(7        input="input",8        model="text-embedding-3-small"9    )10    return response.data[0].embedding11
12print(openAIEmbeddings("Golden Retreiver"))

Creating a vector database with SingleStore

We will be using the LangChain framework, with SingleStore as the vector database to store our embeddings and a public .txt file link that is about the Sherlock Holmes stories.

Add OpenAI API key as an environment variable.

1import os2os.environ['OPENAI_API_KEY'] = 'mention your openai api key'

Then add the following code.

1import openai2from langchain.text_splitter import CharacterTextSplitter3from langchain_community.document_loaders import TextLoader4from langchain_community.embeddings import OpenAIEmbeddings5from langchain_community.vectorstores.singlestoredb import SingleStoreDB6import os7import pandas as pd8import requests9
10# URL of the public .txt file you want to use11file_url = "https://sherlock-holm.es/stories/plain-text/stud.txt"12
13# Send a GET request to the file URL14response = requests.get(file_url)15
16# Proceed if the file was successfully downloaded17if response.status_code == 200:18    file_content = response.text19
20    # Save the content to a file21    file_path = 'downloaded_example.txt'22    with open(file_path, 'w', encoding='utf-8') as f:23        f.write(file_content)24
25    # Now, you can proceed with your original code using 'downloaded_example.txt'26    # Load and process documents27    loader = TextLoader(file_path)  # Use the downloaded document28
29    documents = loader.load()30    text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)31    docs = text_splitter.split_documents(documents)32
33    # Generate embeddings and create a document search database34    OPENAI_KEY = "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" 35    # Replace with your OpenAI API key36    embeddings = OpenAIEmbeddings(api_key=OPENAI_KEY)37
38    # Create Vector Database39    vector_database = SingleStoreDB.from_documents(docs, embeddings, table_name="scarlet")40    # Replace "your_table_name" with your actual table name41
42    query = "which university did he study?"43    docs = vector_database.similarity_search(query)44    print(docs[0].page_content)45
46else:47    print("Failed to download the file. Please check the URL and try again.")

Once you’ve run the code, you will see a tab to enter the query/question you would like to ask related to Sherlock Holmes.

We retrieved the relevant information from the provided data, using it to guide the response generation process. By converting our file into embeddings and storing them in SingleStore database, we created a retrievable corpus of information — ensuring the responses are not only relevant, but also rich in content derived from the provided dataset.

Conclusion

Retrieval Augmented Generation represents a significant leap in the evolution of language models. By combining the power of retrieval mechanisms with sequence-to-sequence generation, RAG models can provide richer, more detailed and contextually relevant outputs. As the field advances, we can expect to see even more sophisticated integrations of these components, paving the way for AI models that are not just knowledgeable, but also resourceful.