GEN AI: RAG with Chat History using LangChain & Ollama for free

Introduction

In this article we will learn how to use LLMs to get more insights into a website for free. We will start with setting up ollama on google Collab and then will dive deeper into RAG and chat history. For more information on how to do this setup, please go through below article. By the end of this article you will be able to load data from a website and have contextual chat with LLMs on the websites content.

Running Ollama on Google Colab and using LLM models for free

Step 1 : Initialize the local model.

For this POC we will be using Mistral 7B, which is one of the most powerful model in its size.

from langchain_community.llms import Ollama 
llm = Ollama(model = "mistral")

To make sure, we are able to connect to the model and get response, run below command:
llm.invoke("Tell me a short joke on namit")

Here’s a playful, light-hearted joke involving the name “Namit”. Remember, humor can be subjective, so take this in good spirits!

Why did Namit bring an umbrella to the desert?

Because he heard it was going to rain-mit! (Rhymes with Namit and sounds like ‘rain’)

I hope you enjoy it! If not, I apologize for any unintentional discomfort. Humor is a delicate balancing act, and sometimes it doesn’t land as intended.

Step 2: Load data from the website

For loading data from our website, we will use WebBaseLoader class from langchain_community. Make sure to install this module before running below command.

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(web_path=”https://digitalitskills.com/”)

docs = loader.load()

Step 3: Split Data into smaller chunks

Next we need to break out data into smaller chunks. For this we are going to use RecursiveCharacterTextSplitter, though you can use any other text splitter.

from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    add_start_index = True 
)
I am using chunk_size of 1000 and a overlap of 200. Feel free to play with these numbers and select one which gives better results.

Step 4: Use the splitter to split the document

In the next step, we will feed the document that we loaded to our splitter to split it into chunks.

all_splits = text_splitter.split_documents(docs)

Step 5: Embeddings

For this task we will use nomic-embed-text embeddings which are at par with openAI embeddings. Embeddings are very important as they are responsible of transforming data into numerical representation and can have huge huge impact on the accuracy and response of the models.

from langchain_community import embeddings 
from langchain_community.vectorstores import Chroma 

vectorstore = Chroma.from_documents(
    documents = all_splits,
    embedding=embeddings.OllamaEmbeddings(base_url='http://127.0.0.1:11434',
        model='nomic-embed-text'
    )
)

Lets break down the code: first we are using the splits stored in all_splits and feeding them to the embedding model which will convert the text into numbers. Next we store these numbers in Chroma vectorstore, so that we can easily query and retrieve them.

Step 6: Retriever

A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store. In this example we will use similarity search.

retriever = vectorstore.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k":6}
)
Lets test the retriever, to test type in below command:
retriever.get_relevant_documents("What is forensics?")

Step 7: Standalone question

To create standalone question, we will use below prompt:

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder 

contextualize_q_system_prompt = """Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""

contextualize_q_prompt= ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),
    ]
)
from langchain_core.output_parsers import StrOutputParser

contextualize_q_chain = contextualize_q_prompt | llm | StrOutputParser()

Step 8: Invoke Contextualized chain

from langchain_core.messages import AIMessage, HumanMessage

contextualize_q_chain.invoke(
    {
        "chat_history":[
            HumanMessage(content="What does forensics stand for?"),
            AIMessage(content="Volatility"),
        ],
        "question": "What is meant by Volatile Memory forensics?",
    }
)

Step 9: Create a chain for chat history

qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.\

{context}"""

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),
    ]
)
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def contextualized_question(input: dict):
    if input.get("chat_history"):
        return contextualize_q_chain
    else:
        return input["question"]

this function will return our contextualize_q_chain if chat_history is set otherwise will only return the question. Next we will create out chain as follows

from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    RunnablePassthrough.assign(
        context = contextualized_question | retriever | format_docs
    )
    | qa_prompt 
    | llm
)

Step 10: Bringing everything together

chat_history = []

question = "What is Volatile Memory Forensics?"
ai_msg = rag_chain.invoke(
    {
        "question": question,
        "chat_history": chat_history
    }
)
Lets print the response 
 Volatile Memory Forensics refers to the process of collecting and analyzing data from a computer's volatile memory (RAM) to gather information about its current or past state. This can be used for investigating security incidents, malware analysis, or system troubleshooting. Tools like Volatility Framework are commonly used for this purpose.

Next, we will add both the question and response to chat_history variable

chat_history.extend(
    [
        HumanMessage(content=question), ai_msg
    ]
)

Conclusion

In this final step we will test if our model can answer the next question within the context of the previos question:

second_question = "What is it used for?"

rag_chain.invoke(
    {
        "question": second_question,
        "chat_history": chat_history
    }
)

Response:
 Volatile Memory Forensics is used to investigate security incidents, analyze malware, and troubleshoot systems by collecting data from a computer's volatile memory (RAM). It helps in gathering information about the current or past state of the system. Tools like the Volatility Framework are commonly used for this purpose.
As you can see from the above response, our bot answered the question correctly.

second_question = "What tool can we use for it?"

rag_chain.invoke(
    {
        "question": second_question,
        "chat_history": chat_history
    }
)

Response:
 The tool commonly used for volatile memory forensics is the Volatility Framework. It's a powerful open-source platform for analyzing volatile (RAM) and non-volatile data from Windows, Linux, macOS, and Android systems.