Chapter 7: Building RAG-Based Chatbots

What is RAG (Retrieval-Augmented Generation)?

RAG combines the power of large language models with external knowledge retrieval to create chatbots that can answer questions based on specific documents or data sources.

The RAG Architecture

User Query → Embedding → Vector Search → Context Retrieval → LLM → Response

Components:

Document Ingestion: Convert documents to embeddings
Vector Database: Store and search embeddings efficiently
Retrieval: Find relevant context for queries
Generation: LLM generates answers using retrieved context
Response: Return formatted answer to user

📊 RAG Architecture Flow

Rendering diagram...

Visual representation of how RAG systems process user queries through embedding, retrieval, and generation stages.

📊 RAG Workflow Sequence

Rendering diagram...

Step-by-step sequence showing the interaction between different components in a RAG system.

Why RAG?

Advantages Over Fine-Tuning

No retraining required for new information
Transparent sources - know where answers come from
Easy updates - just add new documents
Cost-effective - no GPU training time
Reduced hallucinations - grounded in actual documents

Use Cases

Customer support chatbots
Documentation assistants
Internal knowledge bases
Research assistants
Educational tutors
Legal document analysis

Vector Databases

What Are Embeddings?

Embeddings are numerical representations of text that capture semantic meaning:

# Text
"AI-driven development is transforming software engineering"

# Embedding (simplified)
[0.23, -0.45, 0.67, ..., 0.12]  # 1536 dimensions for OpenAI's text-embedding-3-small

Similar texts have similar embeddings, enabling semantic search.

Popular Vector Databases

Qdrant - Fast, scalable, open-source
Pinecone - Managed service, easy setup
Weaviate - GraphQL interface
Milvus - Large-scale deployments
ChromaDB - Lightweight, embedded

Qdrant Cloud Setup

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

# Connect to Qdrant Cloud
client = QdrantClient(
    url="https://your-cluster.qdrant.io",
    api_key="your-api-key"
)

# Create collection
client.create_collection(
    collection_name="book_knowledge",
    vectors_config=VectorParams(
        size=1536,  # OpenAI embedding dimension
        distance=Distance.COSINE
    )
)

Document Processing Pipeline

Step 1: Chunk Documents

Break large documents into smaller, semantically meaningful chunks:

def chunk_document(text: str, chunk_size: int = 1000, overlap: int = 200) -> list[str]:
    """Split document into overlapping chunks"""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]

        # Try to break at sentence boundary
        if end < len(text):
            last_period = chunk.rfind('.')
            if last_period > chunk_size * 0.7:  # At least 70% of chunk size
                end = start + last_period + 1
                chunk = text[start:end]

        chunks.append(chunk.strip())
        start = end - overlap  # Overlap for context continuity

    return chunks

Step 2: Generate Embeddings

from openai import OpenAI

client = OpenAI(api_key="your-key")

def get_embedding(text: str) -> list[float]:
    """Get embedding for text"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

Step 3: Store in Vector Database

from qdrant_client.models import PointStruct
import uuid

def ingest_document(document: str, metadata: dict):
    """Process and store document in Qdrant"""
    chunks = chunk_document(document)

    points = []
    for i, chunk in enumerate(chunks):
        embedding = get_embedding(chunk)

        point = PointStruct(
            id=str(uuid.uuid4()),
            vector=embedding,
            payload={
                "text": chunk,
                "chunk_index": i,
                **metadata
            }
        )
        points.append(point)

    client.upsert(
        collection_name="book_knowledge",
        points=points
    )

Retrieval Strategy

Basic Semantic Search

def search_knowledge(query: str, top_k: int = 5) -> list[dict]:
    """Search for relevant context"""
    query_embedding = get_embedding(query)

    results = client.search(
        collection_name="book_knowledge",
        query_vector=query_embedding,
        limit=top_k
    )

    return [
        {
            "text": hit.payload["text"],
            "score": hit.score,
            "metadata": hit.payload
        }
        for hit in results
    ]

Hybrid Search

Combine semantic search with keyword matching:

def hybrid_search(query: str, top_k: int = 5) -> list[dict]:
    """Combine vector and keyword search"""
    query_embedding = get_embedding(query)

    results = client.search(
        collection_name="book_knowledge",
        query_vector=query_embedding,
        query_filter={
            "must": [
                {
                    "key": "text",
                    "match": {
                        "text": query
                    }
                }
            ]
        },
        limit=top_k
    )

    return [{"text": hit.payload["text"], "score": hit.score} for hit in results]

OpenAI Agents SDK

Creating a RAG Agent

import { Agent } from '@openai/agents-sdk';
import { QdrantClient } from '@qdrant/js-client-rest';

const agent = new Agent({
  model: 'gpt-4o',
  instructions: `You are a helpful assistant that answers questions about
  AI-Driven Development. Use the provided context to answer questions accurately.
  If you don't know the answer based on the context, say so.`,
  tools: [
    {
      type: 'function',
      function: {
        name: 'search_knowledge',
        description: 'Search the knowledge base for relevant information',
        parameters: {
          type: 'object',
          properties: {
            query: {
              type: 'string',
              description: 'The search query'
            }
          },
          required: ['query']
        }
      }
    }
  ]
});

// Handle tool calls
agent.on('tool_call', async (toolCall) => {
  if (toolCall.function.name === 'search_knowledge') {
    const { query } = JSON.parse(toolCall.function.arguments);
    const results = await searchKnowledge(query);

    return {
      tool_call_id: toolCall.id,
      output: JSON.stringify(results)
    };
  }
});

Running the Agent

const response = await agent.run({
  messages: [
    { role: 'user', content: 'What are the benefits of spec-driven development?' }
  ]
});

console.log(response.content);

FastAPI Backend

Project Structure

backend/
├── app/
│   ├── __init__.py
│   ├── main.py
│   ├── models.py
│   ├── routes/
│   │   ├── __init__.py
│   │   ├── chat.py
│   │   └── documents.py
│   ├── services/
│   │   ├── __init__.py
│   │   ├── embeddings.py
│   │   ├── qdrant.py
│   │   └── rag.py
│   └── config.py
├── requirements.txt
└── Dockerfile

Main Application

# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from app.routes import chat, documents
from app.config import settings

app = FastAPI(
    title="RAG Chatbot API",
    description="API for AI-Driven Development book chatbot",
    version="1.0.0"
)

# CORS configuration
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Configure for production
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Include routers
app.include_router(chat.router, prefix="/api/chat", tags=["chat"])
app.include_router(documents.router, prefix="/api/documents", tags=["documents"])

@app.get("/")
async def root():
    return {"message": "RAG Chatbot API", "status": "running"}

Chat Endpoint

# app/routes/chat.py
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from app.services.rag import RAGService

router = APIRouter()
rag_service = RAGService()

class ChatRequest(BaseModel):
    message: str
    context: str | None = None  # Optional selected text context
    conversation_id: str | None = None

class ChatResponse(BaseModel):
    response: str
    sources: list[dict]
    conversation_id: str

@router.post("/", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """Handle chat message with RAG"""
    try:
        result = await rag_service.generate_response(
            query=request.message,
            context=request.context,
            conversation_id=request.conversation_id
        )
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

RAG Service

# app/services/rag.py
from openai import AsyncOpenAI
from qdrant_client import AsyncQdrantClient
from app.config import settings

class RAGService:
    def __init__(self):
        self.openai = AsyncOpenAI(api_key=settings.OPENAI_API_KEY)
        self.qdrant = AsyncQdrantClient(
            url=settings.QDRANT_URL,
            api_key=settings.QDRANT_API_KEY
        )

    async def generate_response(
        self,
        query: str,
        context: str | None = None,
        conversation_id: str | None = None
    ) -> dict:
        """Generate RAG response"""

        # If context provided, search within it
        if context:
            relevant_chunks = await self._search_in_context(query, context)
        else:
            # Search entire knowledge base
            relevant_chunks = await self._search_knowledge(query)

        # Build prompt with context
        context_text = "\n\n".join([
            f"[Source {i+1}]: {chunk['text']}"
            for i, chunk in enumerate(relevant_chunks)
        ])

        # Generate response
        response = await self.openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": f"""You are a helpful assistant for the AI-Driven
                    Development book. Answer questions based on the provided context.
                    Cite sources when appropriate.

                    Context:
                    {context_text}
                    """
                },
                {
                    "role": "user",
                    "content": query
                }
            ]
        )

        return {
            "response": response.choices[0].message.content,
            "sources": [
                {
                    "text": chunk["text"][:200] + "...",
                    "score": chunk["score"]
                }
                for chunk in relevant_chunks
            ],
            "conversation_id": conversation_id or "new"
        }

    async def _search_knowledge(self, query: str, top_k: int = 5) -> list[dict]:
        """Search vector database"""
        query_embedding = await self._get_embedding(query)

        results = await self.qdrant.search(
            collection_name=settings.QDRANT_COLLECTION,
            query_vector=query_embedding,
            limit=top_k
        )

        return [
            {
                "text": hit.payload["text"],
                "score": hit.score
            }
            for hit in results
        ]

    async def _search_in_context(self, query: str, context: str) -> list[dict]:
        """Search within provided context"""
        # Chunk the context
        chunks = self._chunk_text(context)

        # Get embeddings for query and chunks
        query_embedding = await self._get_embedding(query)
        chunk_embeddings = [await self._get_embedding(chunk) for chunk in chunks]

        # Calculate similarity scores
        scores = [
            self._cosine_similarity(query_embedding, chunk_emb)
            for chunk_emb in chunk_embeddings
        ]

        # Return top chunks
        sorted_chunks = sorted(
            zip(chunks, scores),
            key=lambda x: x[1],
            reverse=True
        )[:3]

        return [
            {"text": chunk, "score": score}
            for chunk, score in sorted_chunks
        ]

    async def _get_embedding(self, text: str) -> list[float]:
        """Get text embedding"""
        response = await self.openai.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def _chunk_text(self, text: str, size: int = 500) -> list[str]:
        """Split text into chunks"""
        words = text.split()
        chunks = []
        for i in range(0, len(words), size):
            chunk = " ".join(words[i:i + size])
            chunks.append(chunk)
        return chunks

    def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
        """Calculate cosine similarity"""
        import math
        dot_product = sum(x * y for x, y in zip(a, b))
        magnitude_a = math.sqrt(sum(x * x for x in a))
        magnitude_b = math.sqrt(sum(y * y for y in b))
        return dot_product / (magnitude_a * magnitude_b)

Summary

In this chapter, you learned:

RAG architecture and components
Vector databases and embeddings
Document processing pipelines
OpenAI Agents SDK
FastAPI backend implementation
Context-aware search functionality

🎴 Test Your Knowledge

🎴 Chapter 3: RAG Chatbots Flashcards

Card 1 of 10✅ Mastered: 0/10

Question

What is RAG?

Click to flip →

Answer

Retrieval-Augmented Generation - a technique that combines vector search with LLM generation to provide accurate, context-aware responses based on specific documents.

← Click to flip back

📝 Chapter Quiz

📝 Chapter 7 Quiz

Test your understanding with these multiple choice questions

Question 1

What does RAG stand for in AI systems?

Question 2

How do vector embeddings enable semantic search in RAG?

Question 3

Which component stores vector embeddings in a RAG system?

Question 4

What is a common challenge in RAG systems?

Question 5

Why is source citation important in RAG chatbots?

Next Chapter: We'll integrate the chatbot into the Docusaurus frontend and implement text selection features.

What is RAG (Retrieval-Augmented Generation)?​

The RAG Architecture​

📊 RAG Architecture Flow

📊 RAG Workflow Sequence

Why RAG?​

Advantages Over Fine-Tuning​

Use Cases​

Vector Databases​

What Are Embeddings?​

Popular Vector Databases​

Qdrant Cloud Setup​

Document Processing Pipeline​

Step 1: Chunk Documents​

Step 2: Generate Embeddings​

Step 3: Store in Vector Database​

Retrieval Strategy​

Basic Semantic Search​

Hybrid Search​

OpenAI Agents SDK​

Creating a RAG Agent​

Running the Agent​

FastAPI Backend​

Project Structure​

Main Application​

Chat Endpoint​

RAG Service​

Summary​

🎴 Test Your Knowledge​