Chapter 7: Building RAG-Based Chatbots
What is RAG (Retrieval-Augmented Generation)?
RAG combines the power of large language models with external knowledge retrieval to create chatbots that can answer questions based on specific documents or data sources.
The RAG Architecture
User Query → Embedding → Vector Search → Context Retrieval → LLM → Response
Components:
- Document Ingestion: Convert documents to embeddings
- Vector Database: Store and search embeddings efficiently
- Retrieval: Find relevant context for queries
- Generation: LLM generates answers using retrieved context
- Response: Return formatted answer to user
📊 RAG Architecture Flow
Rendering diagram...
Visual representation of how RAG systems process user queries through embedding, retrieval, and generation stages.
📊 RAG Workflow Sequence
Rendering diagram...
Step-by-step sequence showing the interaction between different components in a RAG system.
Why RAG?
Advantages Over Fine-Tuning
- No retraining required for new information
- Transparent sources - know where answers come from
- Easy updates - just add new documents
- Cost-effective - no GPU training time
- Reduced hallucinations - grounded in actual documents
Use Cases
- Customer support chatbots
- Documentation assistants
- Internal knowledge bases
- Research assistants
- Educational tutors
- Legal document analysis
Vector Databases
What Are Embeddings?
Embeddings are numerical representations of text that capture semantic meaning:
# Text
"AI-driven development is transforming software engineering"
# Embedding (simplified)
[0.23, -0.45, 0.67, ..., 0.12] # 1536 dimensions for OpenAI's text-embedding-3-small
Similar texts have similar embeddings, enabling semantic search.
Popular Vector Databases
- Qdrant - Fast, scalable, open-source
- Pinecone - Managed service, easy setup
- Weaviate - GraphQL interface
- Milvus - Large-scale deployments
- ChromaDB - Lightweight, embedded
Qdrant Cloud Setup
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
# Connect to Qdrant Cloud
client = QdrantClient(
url="https://your-cluster.qdrant.io",
api_key="your-api-key"
)
# Create collection
client.create_collection(
collection_name="book_knowledge",
vectors_config=VectorParams(
size=1536, # OpenAI embedding dimension
distance=Distance.COSINE
)
)
Document Processing Pipeline
Step 1: Chunk Documents
Break large documents into smaller, semantically meaningful chunks:
def chunk_document(text: str, chunk_size: int = 1000, overlap: int = 200) -> list[str]:
"""Split document into overlapping chunks"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to break at sentence boundary
if end < len(text):
last_period = chunk.rfind('.')
if last_period > chunk_size * 0.7: # At least 70% of chunk size
end = start + last_period + 1
chunk = text[start:end]
chunks.append(chunk.strip())
start = end - overlap # Overlap for context continuity
return chunks
Step 2: Generate Embeddings
from openai import OpenAI
client = OpenAI(api_key="your-key")
def get_embedding(text: str) -> list[float]:
"""Get embedding for text"""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
Step 3: Store in Vector Database
from qdrant_client.models import PointStruct
import uuid
def ingest_document(document: str, metadata: dict):
"""Process and store document in Qdrant"""
chunks = chunk_document(document)
points = []
for i, chunk in enumerate(chunks):
embedding = get_embedding(chunk)
point = PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
"text": chunk,
"chunk_index": i,
**metadata
}
)
points.append(point)
client.upsert(
collection_name="book_knowledge",
points=points
)
Retrieval Strategy
Basic Semantic Search
def search_knowledge(query: str, top_k: int = 5) -> list[dict]:
"""Search for relevant context"""
query_embedding = get_embedding(query)
results = client.search(
collection_name="book_knowledge",
query_vector=query_embedding,
limit=top_k
)
return [
{
"text": hit.payload["text"],
"score": hit.score,
"metadata": hit.payload
}
for hit in results
]
Hybrid Search
Combine semantic search with keyword matching:
def hybrid_search(query: str, top_k: int = 5) -> list[dict]:
"""Combine vector and keyword search"""
query_embedding = get_embedding(query)
results = client.search(
collection_name="book_knowledge",
query_vector=query_embedding,
query_filter={
"must": [
{
"key": "text",
"match": {
"text": query
}
}
]
},
limit=top_k
)
return [{"text": hit.payload["text"], "score": hit.score} for hit in results]
OpenAI Agents SDK
Creating a RAG Agent
import { Agent } from '@openai/agents-sdk';
import { QdrantClient } from '@qdrant/js-client-rest';
const agent = new Agent({
model: 'gpt-4o',
instructions: `You are a helpful assistant that answers questions about
AI-Driven Development. Use the provided context to answer questions accurately.
If you don't know the answer based on the context, say so.`,
tools: [
{
type: 'function',
function: {
name: 'search_knowledge',
description: 'Search the knowledge base for relevant information',
parameters: {
type: 'object',
properties: {
query: {
type: 'string',
description: 'The search query'
}
},
required: ['query']
}
}
}
]
});
// Handle tool calls
agent.on('tool_call', async (toolCall) => {
if (toolCall.function.name === 'search_knowledge') {
const { query } = JSON.parse(toolCall.function.arguments);
const results = await searchKnowledge(query);
return {
tool_call_id: toolCall.id,
output: JSON.stringify(results)
};
}
});
Running the Agent
const response = await agent.run({
messages: [
{ role: 'user', content: 'What are the benefits of spec-driven development?' }
]
});
console.log(response.content);
FastAPI Backend
Project Structure
backend/
├── app/
│ ├── __init__.py
│ ├── main.py
│ ├── models.py
│ ├── routes/
│ │ ├── __init__.py
│ │ ├── chat.py
│ │ └── documents.py
│ ├── services/
│ │ ├── __init__.py
│ │ ├── embeddings.py
│ │ ├── qdrant.py
│ │ └── rag.py
│ └── config.py
├── requirements.txt
└── Dockerfile
Main Application
# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.routes import chat, documents
from app.config import settings
app = FastAPI(
title="RAG Chatbot API",
description="API for AI-Driven Development book chatbot",
version="1.0.0"
)
# CORS configuration
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Configure for production
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Include routers
app.include_router(chat.router, prefix="/api/chat", tags=["chat"])
app.include_router(documents.router, prefix="/api/documents", tags=["documents"])
@app.get("/")
async def root():
return {"message": "RAG Chatbot API", "status": "running"}
Chat Endpoint
# app/routes/chat.py
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from app.services.rag import RAGService
router = APIRouter()
rag_service = RAGService()
class ChatRequest(BaseModel):
message: str
context: str | None = None # Optional selected text context
conversation_id: str | None = None
class ChatResponse(BaseModel):
response: str
sources: list[dict]
conversation_id: str
@router.post("/", response_model=ChatResponse)
async def chat(request: ChatRequest):
"""Handle chat message with RAG"""
try:
result = await rag_service.generate_response(
query=request.message,
context=request.context,
conversation_id=request.conversation_id
)
return result
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
RAG Service
# app/services/rag.py
from openai import AsyncOpenAI
from qdrant_client import AsyncQdrantClient
from app.config import settings
class RAGService:
def __init__(self):
self.openai = AsyncOpenAI(api_key=settings.OPENAI_API_KEY)
self.qdrant = AsyncQdrantClient(
url=settings.QDRANT_URL,
api_key=settings.QDRANT_API_KEY
)
async def generate_response(
self,
query: str,
context: str | None = None,
conversation_id: str | None = None
) -> dict:
"""Generate RAG response"""
# If context provided, search within it
if context:
relevant_chunks = await self._search_in_context(query, context)
else:
# Search entire knowledge base
relevant_chunks = await self._search_knowledge(query)
# Build prompt with context
context_text = "\n\n".join([
f"[Source {i+1}]: {chunk['text']}"
for i, chunk in enumerate(relevant_chunks)
])
# Generate response
response = await self.openai.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": f"""You are a helpful assistant for the AI-Driven
Development book. Answer questions based on the provided context.
Cite sources when appropriate.
Context:
{context_text}
"""
},
{
"role": "user",
"content": query
}
]
)
return {
"response": response.choices[0].message.content,
"sources": [
{
"text": chunk["text"][:200] + "...",
"score": chunk["score"]
}
for chunk in relevant_chunks
],
"conversation_id": conversation_id or "new"
}
async def _search_knowledge(self, query: str, top_k: int = 5) -> list[dict]:
"""Search vector database"""
query_embedding = await self._get_embedding(query)
results = await self.qdrant.search(
collection_name=settings.QDRANT_COLLECTION,
query_vector=query_embedding,
limit=top_k
)
return [
{
"text": hit.payload["text"],
"score": hit.score
}
for hit in results
]
async def _search_in_context(self, query: str, context: str) -> list[dict]:
"""Search within provided context"""
# Chunk the context
chunks = self._chunk_text(context)
# Get embeddings for query and chunks
query_embedding = await self._get_embedding(query)
chunk_embeddings = [await self._get_embedding(chunk) for chunk in chunks]
# Calculate similarity scores
scores = [
self._cosine_similarity(query_embedding, chunk_emb)
for chunk_emb in chunk_embeddings
]
# Return top chunks
sorted_chunks = sorted(
zip(chunks, scores),
key=lambda x: x[1],
reverse=True
)[:3]
return [
{"text": chunk, "score": score}
for chunk, score in sorted_chunks
]
async def _get_embedding(self, text: str) -> list[float]:
"""Get text embedding"""
response = await self.openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def _chunk_text(self, text: str, size: int = 500) -> list[str]:
"""Split text into chunks"""
words = text.split()
chunks = []
for i in range(0, len(words), size):
chunk = " ".join(words[i:i + size])
chunks.append(chunk)
return chunks
def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
"""Calculate cosine similarity"""
import math
dot_product = sum(x * y for x, y in zip(a, b))
magnitude_a = math.sqrt(sum(x * x for x in a))
magnitude_b = math.sqrt(sum(y * y for y in b))
return dot_product / (magnitude_a * magnitude_b)
Summary
In this chapter, you learned:
- RAG architecture and components
- Vector databases and embeddings
- Document processing pipelines
- OpenAI Agents SDK
- FastAPI backend implementation
- Context-aware search functionality
🎴 Test Your Knowledge
🎴 Chapter 3: RAG Chatbots Flashcards
Card 1 of 10✅ Mastered: 0/10
Question
What is RAG?
Click to flip →
Answer
Retrieval-Augmented Generation - a technique that combines vector search with LLM generation to provide accurate, context-aware responses based on specific documents.
← Click to flip back
📝 Chapter Quiz
📝 Chapter 7 Quiz
Test your understanding with these multiple choice questions
Question 1
What does RAG stand for in AI systems?
Question 2
How do vector embeddings enable semantic search in RAG?
Question 3
Which component stores vector embeddings in a RAG system?
Question 4
What is a common challenge in RAG systems?
Question 5
Why is source citation important in RAG chatbots?
Next Chapter: We'll integrate the chatbot into the Docusaurus frontend and implement text selection features.