Skip to Content
LearningR2: RAG Architecture

R2: RAG Architecture

RAG (Retrieval-Augmented Generation) grounds LLM responses in your data instead of relying on training-time knowledge. It solves three fundamental LLM limitations. For prompting fundamentals, see R1: Prompt Engineering.

Why RAG Exists

LLM LimitationWhat HappensRAG Solution
Knowledge cutoffModel doesn’t know events after training dateRetrieve current docs at query time
HallucinationModel fabricates plausible-sounding answersGround responses in retrieved evidence
No private dataModel can’t access your internal docs, APIs, databasesIndex private data into a search service

RAG vs Fine-Tuning vs Prompt Engineering

DimensionPrompt EngineeringRAGFine-Tuning
Adds new knowledgeβŒβœ… Real-timeβœ… Baked in
Cites sourcesβŒβœ… With retrieval metadata❌
Setup costFreeMedium (search infra + embeddings)High (GPU + labeled data)
Data freshnessTraining cutoffMinutes (re-index)Weeks (re-train)
Best forFormat, tone, stylePrivate/current knowledge Q&ADomain language, specialized behavior

Decision rule: Start with prompt engineering. Add RAG when you need private or current data. Fine-tune only when RAG + prompting can’t achieve the required style or accuracy.

The Two Pipelines

RAG has two distinct pipelines β€” Offline (Ingestion) runs ahead of time, Online (Query) runs per request.

Offline: Ingestion Pipeline

Documents β†’ Load β†’ Extract β†’ Clean β†’ Chunk β†’ Enrich β†’ Embed β†’ Index
StepWhat It DoesAzure Service
Document LoadingFetch from blob, SharePoint, SQL, APIsAzure Blob Storage, Data Factory
ExtractionOCR, table extraction, layout analysisAzure Document Intelligence
CleaningRemove headers, footers, boilerplate, PIICustom code + Presidio
ChunkingSplit into retrieval-friendly segmentsCustom (see strategies below)
EnrichmentAdd metadata: title, source, date, entitiesAzure AI Language, custom
EmbeddingConvert text β†’ dense vectors (1536-3072 dims)Azure OpenAI text-embedding-3-large
IndexingStore vectors + metadata for fast retrievalAzure AI Search

Online: Query Pipeline

User Query β†’ Process β†’ Embed β†’ Retrieve β†’ Rerank β†’ Assemble Context β†’ Generate
StepTypical LatencyWhat Happens
Query Processing5–20 msRewrite, expand, decompose multi-part questions
Query Embedding20–50 msConvert query to same vector space as documents
Retrieval30–80 msHybrid search (keyword + vector) returns top-50
Reranking50–150 msCross-encoder reranks top-50 β†’ top-5
Context Assembly5–10 msFormat retrieved chunks + metadata into prompt
LLM Generation500–3000 msGenerate grounded response with citations
Total~700–3300 msEnd-to-end latency for a single RAG query
ℹ️

Latency Budget

Generation dominates total latency (60-80%). Use streaming to improve perceived performance β€” first tokens arrive in ~200 ms even if full response takes 3 seconds. FrootAI Play 01 implements streaming by default.

Chunking Strategies

Chunking quality is the single biggest factor in RAG accuracy.

StrategyChunk SizeOverlapBest ForTrade-off
Fixed-size512 tokens128 tokens (25%)General-purpose, fastMay split mid-sentence
Recursive256–1024 tokensParagraph boundariesStructured docs (markdown, HTML)Slower, better quality
SemanticVariableEmbedding similarity thresholdComplex, varied documentsExpensive (requires embedding each segment)
Document-awarePer section/pageNonePDFs, slides, legal docsRequires layout understanding
πŸ’‘

Start with Fixed-Size

Fixed-size chunking with 512 tokens and 128-token overlap works for 80% of use cases. Only invest in semantic chunking when evaluation shows retrieval quality issues. See R3: Deterministic AI for evaluation methods.

Modern RAG systems combine keyword search (BM25) and vector search for best results:

Search TypeStrengthsWeaknesses
Keyword (BM25)Exact matches, names, codes, acronymsMisses synonyms and paraphrases
VectorSemantic similarity, handles paraphrasingCan miss exact terms, numbers
HybridBest of both β€” precision + recallSlightly higher latency

Typical weight split: 50–70% vector, 30–50% keyword. Azure AI Search supports hybrid natively with search_type="hybrid".

from azure.search.documents import SearchClient from azure.identity import DefaultAzureCredential from openai import AzureOpenAI credential = DefaultAzureCredential() search_client = SearchClient( endpoint="https://my-search.search.windows.net", index_name="knowledge-base", credential=credential, ) oai_client = AzureOpenAI( azure_endpoint="https://my-oai.openai.azure.com/", api_version="2024-12-01-preview", azure_deployment="gpt-4o", ) def rag_query(question: str, top_k: int = 5) -> str: # 1. Hybrid search (keyword + vector) results = search_client.search( search_text=question, vector_queries=[{ "kind": "text", "text": question, "fields": "content_vector", "k": top_k, }], query_type="semantic", semantic_configuration_name="default", top=top_k, ) # 2. Assemble context with source citations context_parts = [] for i, r in enumerate(results, 1): context_parts.append(f"[{i}] {r['title']}: {r['content']}") context = "\n\n".join(context_parts) # 3. Generate grounded response response = oai_client.chat.completions.create( model="gpt-4o", temperature=0.2, max_tokens=800, messages=[ { "role": "system", "content": ( "Answer using ONLY the provided context. " "Cite sources as [1], [2], etc. " "If the context doesn't contain the answer, say so." ), }, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}, ], ) return response.choices[0].message.content

Azure RAG Architecture

ComponentAzure ServiceSKU (Dev / Prod)
Vector Store + SearchAzure AI SearchFree / Standard S2
EmbeddingsAzure OpenAI text-embedding-3-largePAYG
GenerationAzure OpenAI gpt-4oPAYG / PTU
Document ProcessingAzure Document IntelligenceS0
StorageAzure Blob StorageLRS / GRS
OrchestrationAzure Functions or Container AppsConsumption / Dedicated
⚠️

Always use Managed Identity for service-to-service auth β€” never embed API keys in application code. Use Azure Key Vault for any secrets that can’t use Managed Identity. See FrootAI’s security instructions for more.

Key Takeaways

  1. RAG = Ingestion (offline) + Query (online) β€” optimize both independently
  2. Chunking quality drives retrieval quality β€” start with 512 tokens / 128 overlap
  3. Hybrid search (keyword + vector) beats either alone β€” use 50-70% vector weight
  4. Streaming hides latency β€” first tokens arrive in ~200 ms
  5. Cite sources β€” every RAG response must include retrievable references

FrootAI Play 01 (Enterprise RAG) and Play 21 (Agentic RAG) implement production-grade versions of these patterns with evaluation pipelines. Use O1: Semantic Kernel for orchestration.

Last updated on