AILLMRAGChatbotPython

How to Build an LLM-Powered Customer Support Chatbot with RAG

A step-by-step guide to building a production-ready AI customer support chatbot using LangChain, OpenAI, Pinecone vector database, and FastAPI — with human escalation built in.

Softotic Engineering/25 February 2025/3 min read

How to Build an LLM-Powered Customer Support Chatbot with RAG

Generic LLM chatbots hallucinate. They answer confidently with wrong information because they lack context about your business. The solution is RAG (Retrieval-Augmented Generation) — grounding the LLM in your own knowledge base. Here's how Softotic builds this for clients.

What Is RAG?

RAG = Retrieval-Augmented Generation.

Instead of asking the LLM to answer from memory (which leads to hallucinations), you:

  1. Retrieve relevant documents from your knowledge base.
  2. Include them in the LLM's prompt as context.
  3. The LLM generates its answer based only on the retrieved context.

Result: accurate, specific, verifiable answers grounded in your business data.

Architecture Overview

code
User message
    ↓
[Query Embedding] (OpenAI / sentence-transformers)
    ↓
[Vector Search in Pinecone] → Top-K relevant chunks
    ↓
[Prompt Construction] = System prompt + Context chunks + User message
    ↓
[OpenAI GPT-4o] generates response
    ↓
[Confidence check] → if low confidence: escalate to human
    ↓
Response to user

Step 1: Build Your Knowledge Base

Index your knowledge base into a vector database.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

# Load docs (could be PDF, markdown, website scrape)
docs = load_documents("./knowledge_base/")

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore.from_documents(chunks, embeddings, index_name="support-kb")

Step 2: Build the RAG Chain

python
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    verbose=False,
)

Step 3: FastAPI Endpoint

python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    session_id: str
    message: str

@app.post("/chat")
async def chat(req: ChatRequest):
    response = chain.invoke({"question": req.message})
    
    # Escalation trigger: model returns low-confidence signal
    should_escalate = needs_human(response["answer"])
    
    return {
        "answer": response["answer"],
        "escalate": should_escalate,
        "sources": [doc.metadata for doc in response.get("source_documents", [])]
    }

Step 4: Human Escalation

A critical feature often overlooked. When the bot says "I'm not sure" or the user asks for a human, escalate:

  1. Flag the session as escalated in the database.
  2. Alert live agents via WebSocket or notification.
  3. Show the full conversation history to the agent.
  4. Agent takes over; user sees "You're now connected to a support agent."

Step 5: Multi-Channel Integration

  • Web widget: React component connects to __INLINE_CODE_0__ API via WebSocket.
  • WhatsApp: WhatsApp Business API webhook → your chat API → response via WhatsApp.

Production Considerations

  • Session management: Store chat history in Redis with TTL.
  • Rate limiting: Per-IP and per-session to prevent abuse.
  • Content filtering: Validate inputs to prevent prompt injection.
  • Logging: Log all conversations for quality review and model fine-tuning.
  • Monitoring: Track average response latency, escalation rate, user satisfaction.

Keeping the Knowledge Base Fresh

Set up a pipeline to re-index when content changes:

  • Webhook from your CMS triggers re-ingestion
  • Weekly full re-index as a scheduled job

Conclusion

A RAG-based customer support bot, built properly, reduces support volume by 60–80% while maintaining accuracy. The critical success factor is a well-structured, comprehensive knowledge base.

Ready to add AI support to your product? Softotic's AI team can build it.