Using LlamaIndex for Retrieval Augmented Generation (RAG)

Today I’m going to attempt to use LlamaIndex to perform or implement RAG for the AI Assistant application I’m building.
LlamaIndex seems to support LangChain which I currently use for RAG implementations. I’m trying to see if I could just take my learning experience or the experience I have from building chat assistants on Langchain and use that to jumpstart my coding on LlamaIndex v0.10.

Installing LlamaIndex seemed straight forward.

pip install llama-index

I use poetry package manager so

poetry add llama-index

Current Langchain Code.

import os

# Importing necessary classes from the langchain library
from langchain.chains import RetrievalQA
from langchain.memory import ConversationBufferMemory
from langchain.prompts.prompt import PromptTemplate

# Importing the Replicate class from langchain.llms module
from langchain.llms import Replicate

# Environment variable for setting the embedding model name, default is as specified
EMBEDDING_MODEL_NAME = os.getenv(
    "EMBEDDING_MODEL_NAME", "andersonbcdefg/bge-small-4096"
)

# Environment variables for configuring the Milvus host and port, with default values
MILVUS_HOST = os.getenv("MILVUS_HOST", "localhost")
MILVUS_PORT = os.getenv("MILVUS_PORT", "19530")

def replace_slash_with_dash(model_name):
    # Function to replace slashes in a model name with dashes
    modified_model_name = model_name.replace("/", "-")
    return modified_model_name

def use_replicate_qa(vectordb, chat_history):
    # Convert vector database to a retriever object
    retriever = vectordb.as_retriever(search_kwargs={"k": 2})
    
    # Create an instance of the Replicate model with specified parameters
    llm = Replicate(
        model="mistralai/mixtral-8x7b-instruct-v0.1:cf18decbf51c27fed6bbdc3492312c1c903222a56e3fe9ca02d6cbe5198afc10",
        model_kwargs={
            "temperature": 0.75,
            "max_length": 500,
            "top_p": 1,
            "stop_sequences": "\n ",
        },
    )
    
    # Define a template for prompting with specific instructions for the assistant behavior
    template = """
        You are a helpful, respectful and honest assistant. Context is provided within <context> </context> tags. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. Respond with the answer only.
        
        <context>
        {context}
        </context>
        <Conversation History starts>
        {history}
        <Conversation History ends>
        
        {question}
        
        """

    # Creating a prompt template with the above-defined template
    PROMPT = PromptTemplate(
        input_variables=["history", "context", "question"],
        template=template,
    )
    
    # Initialize conversation memory for managing chat history
    memory = ConversationBufferMemory(
        memory_key="history",       # Key for chat history
        input_key="question",       # Key for current question
        # Uncomment or add other parameters as needed
        # return_messages=True,
        # human_prefix=user_email,
        # ai_prefix="TohjuApp",
    )
    
    # Process each message in chat history and update the conversation memory accordingly
    for msg in chat_history:
        if is_ai_message(msg):
            msg_content = msg["content"]
            memory.chat_memory.add_ai_message(msg_content)
        else:
            msg_content = msg["content"]
            memory.chat_memory.add_user_message(msg_content)

    # Setting up the retrieval-based QA chain with the given large language model, prompt, and memory
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={
            "verbose": True,
            "prompt": PROMPT,
            "memory": memory
        },
    )
    
    # Returns the fully constructed QA chain
    return qa_chain

def is_ai_message(msg):
    # Function to determine if the sender of a message is an AI based on the email address
    return "[email protected]" in msg["sender"]

The code establishes a framework for a conversational AI system using the langchain library.
Specifically, it:
Sets up environment variables to configure the embedding model and Milvus vector database connection details.
Provides a utility function to format model names by replacing slashes with dashes.
Defines a function use_replicate_qa that:

Initializes a retriever object from the vector database.
Creates a Replicate large language model instance with specific configuration parameters.
Configures a prompt template instructing how the AI should behave during conversations.
Initializes a conversation buffer memory to keep track of the chat history.
Iterates over the chat history, distinguishing and storing messages from users and the AI system.
Builds a retrieval-based question answering (QA) chain which integrates the retriever, language model, prompt template, and conversation memory to generate responses.
Includes a helper function is_ai_message to check if a message in the chat history was sent by an AI based on the sender’s email address.

In summary, the main goal of this code is to construct a question-answering AI that can engage in conversations by utilizing the provided chat history and dynamically generating accurate and contextually appropriate responses.

Loading Documents Using LlamaIndex

Copying the previous code over to a new file, I can replace Langchain functionality with Llamaindex. I have noticed that there are a few things that seem to be done differently. A bit similar, but differently. For example, loading documents in LlamaIndex seems much more easier than using Langchain.
Here is a Langchain code snippet. 🫣


loader = DirectoryLoader('./datasets/new_papers/', glob="./*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In LlamaIndex 🥰

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader('./datasets/new_papers/').load_data()

The easiest reader to use is our SimpleDirectoryReader, which creates documents out of every file in a given directory. It is built in to LlamaIndex and can read a variety of formats including Markdown, PDFs, Word documents, PowerPoint decks, images, audio and video.
https://docs.llamaindex.ai/en/stable/understanding/loading/loading.html

LlamaIndex takes care of chunking and text-overlaps under the hood. There are options you can configure for more control.

Embedding Documents

Embeddings are a way of encoding your documents into a complex numerical form that Langchain or LlamaIndex can use. Embedding models transform text into a long sequence of numbers that reflect the meaning of the text. These embedding models have learned to do this from a lot of data, and they can help with many tasks, such as search! In simple terms, if a user wants to know something about dogs, then the embedding for that query will be very close to text that mentions dogs.

I want to take these documents and turn them into an embedding.
Using Langchain 👍️

    embeddings = OpenAIEmbeddings()

    Milvus.from_documents(
        docs,
        embeddings,
        connection_args={
            "host": MILVUS_HOST,
            "port": MILVUS_PORT,
            "uri": f"http://{MILVUS_HOST}:{MILVUS_PORT}",
        },
        collection_name=collection_name,
    )

Using LlamaIndex 🥰.
LlamaIndex provides an InMemory vector store `VectorStoreIndex`.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load documents and build index
documents = SimpleDirectoryReader(
    "../../examples/data/paul_graham"
).load_data()
index = VectorStoreIndex.from_documents(documents)

For my project, I am using Milvus Vector Database. Luckily, LlamaIndex also supports Milvus as a vector store.

    vector_store = MilvusVectorStore(uri=f"http://{MILVUS_HOST}:{MILVUS_PORT}", overwrite=True, collection_name=collection_name)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex.from_documents(
        documents, storage_context=storage_context, embed_model=embed_model,
    )

Querying Documents

Once we save our document in the index, we can query the index for answers. The index will rely on the data it holds as the source of wisdom for our language model.
With Langchain👍️

def get_vector_db(collection_name="collection_1"):
    embedding = OpenAIEmbeddings()
    vectordb = Milvus(
        embedding_function=embedding,
        connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
        collection_name=collection_name,
    )
    return vectordb

vector_db = get_vector_db()
docs = vector_db.similarity_search(prompt.prompt, k=5)

With LlamaIndex🥰

def get_vector_index(collection_name="collection_1"):
    vector_store = MilvusVectorStore(uri=f"http://{MILVUS_HOST}:{MILVUS_PORT}", overwrite=True, collection_name=collection_name)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex.from_vector_store(
        storage_context=storage_context, vector_store=vector_store
    )
    
    return index

query_engine = get_vector_index().as_query_engine()
response = query_engine.query("Your query here")
print(response)

Building the Chatbot; add Memory and Context

So far, I’ve been able to get LlamaIndex to perform simple question and answer operations on documents. My next objective is add conversational memory to the chat assistant.

Subtotal	$0.00
Total	$0.00

Using LlamaIndex for Retrieval Augmented Generation (RAG)

Current Langchain Code.

Loading Documents Using LlamaIndex

Embedding Documents

Querying Documents

Building the Chatbot; add Memory and Context

RAG Strategies

Enhanced precision of JSON output in production using Structured Outputs.

Recent Posts

Recent Comments

Using LlamaIndex for Retrieval Augmented Generation (RAG)

Current Langchain Code.

Loading Documents Using LlamaIndex

Embedding Documents

Querying Documents

Building the Chatbot; add Memory and Context

RAG Strategies

Enhanced precision of JSON output in production using Structured Outputs.

Related Articles

Major Upgrades. Platform Updates June 2025

How AI got me fired from my last job.

Latest AI Tech Updates – June 2025

Latest Updates in AI Tech

Grok 3 Unleashed: AI’s Dangerous Double-Edge Sword Threatens Global Security

Recent Posts

Recent Comments

Shopping Cart