T3: Production Patterns
The gap between an AI demo and a production AI system is enormous. Demos ignore latency, cost, reliability, security, and quality monitoring. This module covers the architecture patterns that bridge that gap. For the security layer, see T2: Responsible AI. For infrastructure deployment, see O5: Infrastructure.
The AI Application Architecture Stackโ
Every production AI application follows this layered architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLIENT (Web, Mobile, Teams, API) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ API GATEWAY (APIM, rate limits, โ
โ auth, semantic caching, metering) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ORCHESTRATION (Semantic Kernel, โ
โ LangChain, custom โ routing, โ
โ prompt construction, tool calls) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ AI SERVICES (Azure OpenAI, โ
โ AI Search, Content Safety, โ
โ Document Intelligence) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ DATA (Cosmos DB, Blob Storage, โ
โ SQL, vector indexes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ PLATFORM (Entra ID, Key Vault, โ
โ Monitor, Private Endpoints) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Hosting Patterns Decision Matrixโ
:::info Container Apps Is the Sweet Spot For most AI applications, Azure Container Apps offers the best balance: built-in scaling, no cluster management, WebSocket support, and simple deployment. Only choose AKS when you need GPU node pools or fine-grained control, and Functions when you have pure event-driven workloads. :::
| Aspect | Container Apps | AKS | App Service | Functions | Copilot Studio |
|---|---|---|---|---|---|
| Complexity | Medium | High | Low | Low | Very Low |
| Auto-scaling | โ KEDA-based | โ HPA/KEDA | โ Rule-based | โ Event-driven | โ Managed |
| GPU Support | โ | โ | โ | โ | โ |
| WebSocket | โ | โ | โ | โ | โ |
| Cold Start | ~2-5s | None | ~5-10s | ~1-10s | None |
| Min Cost (dev) | ~$15/mo | ~$200/mo | ~$13/mo | Pay-per-use | Per-message |
| Best For | Most AI apps | GPU/ML serving | Simple web apps | Event processing | No-code bots |
| Max Concurrency | 300/instance | Unlimited | 100/instance | 200/instance | Managed |
Decision flow: Need GPU? โ AKS. Event-driven/stateless? โ Functions. Low-code bot? โ Copilot Studio. Everything else? โ Container Apps.
API Gateway for AIโ
Azure API Management (APIM) provides critical AI-specific capabilities:
Client โโโถ APIM โโโถ Backend Pool (multiple Azure OpenAI instances)
โ
โโ Rate Limiting (per user/tenant/IP)
โโ Semantic Caching (similarity > 0.95 โ return cached)
โโ Token Metering (track usage per consumer)
โโ Load Balancing (round-robin across regions)
โโ Circuit Breaker (failover on 429/503)
โโ Authentication (Entra ID token validation)
Key policies for AI workloads:
<!-- Rate limiting by token consumption -->
<rate-limit-by-key
calls="100" renewal-period="60"
counter-key="@(context.Subscription.Id)" />
<!-- Semantic caching -->
<azure-openai-semantic-cache-store duration="300" />
<!-- Token metering (emit to Event Hub) -->
<azure-openai-emit-token-metric>
<dimension name="Subscription" value="@(context.Subscription.Id)" />
</azure-openai-emit-token-metric>
Resilience Patternsโ
Retry with Exponential Backoffโ
All external API calls need retry logic. Default: 3 retries at 1s / 2s / 4s with jitter:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10)
)
async def call_openai(prompt: str) -> str:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
timeout=30
)
return response.choices[0].message.content
Circuit Breakerโ
When a downstream service fails repeatedly, stop calling it temporarily:
| State | Behavior | Transition |
|---|---|---|
| Closed | Requests pass through normally | โ Open after N failures |
| Open | Requests fail immediately (return fallback) | โ Half-Open after timeout |
| Half-Open | Allow one test request | โ Closed if success, Open if failure |
Fallback Strategyโ
Primary: Azure OpenAI (East US)
โ 429/503?
Fallback 1: Azure OpenAI (West US)
โ 429/503?
Fallback 2: Cached response (semantic match)
โ No cache hit?
Fallback 3: "I'm currently experiencing high demand. Please try again."
:::warning Never Return Raw Errors Never expose raw model errors or stack traces to users. Always return a user-friendly fallback message. Log the full error internally with a correlation ID for debugging. :::
Cost Control Patternsโ
| Pattern | Savings | Implementation |
|---|---|---|
| Model routing | 60-80% | GPT-4o-mini for simple queries, GPT-4o for complex |
| Semantic caching | 30-50% | Cache responses for similar queries (cosine > 0.95) |
| Token budgets | 20-40% | Set max_tokens per request type |
| Prompt compression | 10-30% | Remove redundant instructions from prompts |
| Off-peak scheduling | 10-20% | Batch non-urgent work to low-traffic hours |
Model routing example โ classify query complexity, then route:
async def route_model(query: str) -> str:
# Simple: factual lookup, formatting, classification
if is_simple_query(query):
return "gpt-4o-mini" # ~$0.15/1M input tokens
# Complex: reasoning, analysis, multi-step
return "gpt-4o" # ~$2.50/1M input tokens
Monitoring & Observabilityโ
Custom AI Metricsโ
Track these in Application Insights with correlation IDs:
| Metric | What to Track | Alert Threshold |
|---|---|---|
| Latency | P50, P95, P99 per model | P95 > 5s |
| Token usage | Input + output tokens per request | > budget by 20% |
| Groundedness | Evaluation score per response | < 4.0 average |
| Error rate | 429s, 503s, content filter blocks | > 5% |
| Cost | Daily/weekly/monthly spend | > budget by 10% |
Production Readiness Checklistโ
| Category | โ Required |
|---|---|
| Security | Managed Identity, Key Vault, RBAC, private endpoints, content filtering |
| Reliability | Retry (3x exponential), circuit breaker, fallback responses, health endpoint |
| Performance | Streaming responses, semantic caching, model routing, async I/O |
| Cost | Token budgets, model routing, rate limiting, spending alerts |
| Monitoring | Application Insights, custom AI metrics, correlation IDs, dashboards |
| Compliance | Content safety enabled, audit logging, PII redaction, data residency |
| Operations | CI/CD pipeline, IaC (Bicep), blue-green deploy, runbooks |
| Evaluation | Automated eval pipeline, groundedness โฅ 4.0, A/B testing |
Key Takeawaysโ
- Layer your architecture โ gateway, orchestration, services, data, platform
- Container Apps for most workloads โ AKS only for GPU, Functions only for events
- Retry everything โ 3 retries, exponential backoff, always have a fallback
- Route by complexity โ GPT-4o-mini handles 70%+ of queries at 10% of the cost
- Monitor AI-specific metrics โ latency, tokens, groundedness, not just HTTP status codes
Previous: T2: Responsible AI. Explore the Solution Plays Overview for production-ready implementations of these patterns.