Skip to main content

T3: Production Patterns

The gap between an AI demo and a production AI system is enormous. Demos ignore latency, cost, reliability, security, and quality monitoring. This module covers the architecture patterns that bridge that gap. For the security layer, see T2: Responsible AI. For infrastructure deployment, see O5: Infrastructure.

The AI Application Architecture Stackโ€‹

Every production AI application follows this layered architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ CLIENT (Web, Mobile, Teams, API) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ API GATEWAY (APIM, rate limits, โ”‚
โ”‚ auth, semantic caching, metering) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ORCHESTRATION (Semantic Kernel, โ”‚
โ”‚ LangChain, custom โ€” routing, โ”‚
โ”‚ prompt construction, tool calls) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ AI SERVICES (Azure OpenAI, โ”‚
โ”‚ AI Search, Content Safety, โ”‚
โ”‚ Document Intelligence) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ DATA (Cosmos DB, Blob Storage, โ”‚
โ”‚ SQL, vector indexes) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ PLATFORM (Entra ID, Key Vault, โ”‚
โ”‚ Monitor, Private Endpoints) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Hosting Patterns Decision Matrixโ€‹

:::info Container Apps Is the Sweet Spot For most AI applications, Azure Container Apps offers the best balance: built-in scaling, no cluster management, WebSocket support, and simple deployment. Only choose AKS when you need GPU node pools or fine-grained control, and Functions when you have pure event-driven workloads. :::

AspectContainer AppsAKSApp ServiceFunctionsCopilot Studio
ComplexityMediumHighLowLowVery Low
Auto-scalingโœ… KEDA-basedโœ… HPA/KEDAโœ… Rule-basedโœ… Event-drivenโœ… Managed
GPU SupportโŒโœ…โŒโŒโŒ
WebSocketโœ…โœ…โœ…โŒโŒ
Cold Start~2-5sNone~5-10s~1-10sNone
Min Cost (dev)~$15/mo~$200/mo~$13/moPay-per-usePer-message
Best ForMost AI appsGPU/ML servingSimple web appsEvent processingNo-code bots
Max Concurrency300/instanceUnlimited100/instance200/instanceManaged

Decision flow: Need GPU? โ†’ AKS. Event-driven/stateless? โ†’ Functions. Low-code bot? โ†’ Copilot Studio. Everything else? โ†’ Container Apps.

API Gateway for AIโ€‹

Azure API Management (APIM) provides critical AI-specific capabilities:

Client โ”€โ”€โ–ถ APIM โ”€โ”€โ–ถ Backend Pool (multiple Azure OpenAI instances)
โ”‚
โ”œโ”€ Rate Limiting (per user/tenant/IP)
โ”œโ”€ Semantic Caching (similarity > 0.95 โ†’ return cached)
โ”œโ”€ Token Metering (track usage per consumer)
โ”œโ”€ Load Balancing (round-robin across regions)
โ”œโ”€ Circuit Breaker (failover on 429/503)
โ””โ”€ Authentication (Entra ID token validation)

Key policies for AI workloads:

<!-- Rate limiting by token consumption -->
<rate-limit-by-key
calls="100" renewal-period="60"
counter-key="@(context.Subscription.Id)" />

<!-- Semantic caching -->
<azure-openai-semantic-cache-store duration="300" />

<!-- Token metering (emit to Event Hub) -->
<azure-openai-emit-token-metric>
<dimension name="Subscription" value="@(context.Subscription.Id)" />
</azure-openai-emit-token-metric>

Resilience Patternsโ€‹

Retry with Exponential Backoffโ€‹

All external API calls need retry logic. Default: 3 retries at 1s / 2s / 4s with jitter:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10)
)
async def call_openai(prompt: str) -> str:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
timeout=30
)
return response.choices[0].message.content

Circuit Breakerโ€‹

When a downstream service fails repeatedly, stop calling it temporarily:

StateBehaviorTransition
ClosedRequests pass through normallyโ†’ Open after N failures
OpenRequests fail immediately (return fallback)โ†’ Half-Open after timeout
Half-OpenAllow one test requestโ†’ Closed if success, Open if failure

Fallback Strategyโ€‹

Primary: Azure OpenAI (East US)
โ†“ 429/503?
Fallback 1: Azure OpenAI (West US)
โ†“ 429/503?
Fallback 2: Cached response (semantic match)
โ†“ No cache hit?
Fallback 3: "I'm currently experiencing high demand. Please try again."

:::warning Never Return Raw Errors Never expose raw model errors or stack traces to users. Always return a user-friendly fallback message. Log the full error internally with a correlation ID for debugging. :::

Cost Control Patternsโ€‹

PatternSavingsImplementation
Model routing60-80%GPT-4o-mini for simple queries, GPT-4o for complex
Semantic caching30-50%Cache responses for similar queries (cosine > 0.95)
Token budgets20-40%Set max_tokens per request type
Prompt compression10-30%Remove redundant instructions from prompts
Off-peak scheduling10-20%Batch non-urgent work to low-traffic hours

Model routing example โ€” classify query complexity, then route:

async def route_model(query: str) -> str:
# Simple: factual lookup, formatting, classification
if is_simple_query(query):
return "gpt-4o-mini" # ~$0.15/1M input tokens
# Complex: reasoning, analysis, multi-step
return "gpt-4o" # ~$2.50/1M input tokens

Monitoring & Observabilityโ€‹

Custom AI Metricsโ€‹

Track these in Application Insights with correlation IDs:

MetricWhat to TrackAlert Threshold
LatencyP50, P95, P99 per modelP95 > 5s
Token usageInput + output tokens per request> budget by 20%
GroundednessEvaluation score per response< 4.0 average
Error rate429s, 503s, content filter blocks> 5%
CostDaily/weekly/monthly spend> budget by 10%

Production Readiness Checklistโ€‹

Categoryโœ… Required
SecurityManaged Identity, Key Vault, RBAC, private endpoints, content filtering
ReliabilityRetry (3x exponential), circuit breaker, fallback responses, health endpoint
PerformanceStreaming responses, semantic caching, model routing, async I/O
CostToken budgets, model routing, rate limiting, spending alerts
MonitoringApplication Insights, custom AI metrics, correlation IDs, dashboards
ComplianceContent safety enabled, audit logging, PII redaction, data residency
OperationsCI/CD pipeline, IaC (Bicep), blue-green deploy, runbooks
EvaluationAutomated eval pipeline, groundedness โ‰ฅ 4.0, A/B testing

Key Takeawaysโ€‹

  1. Layer your architecture โ€” gateway, orchestration, services, data, platform
  2. Container Apps for most workloads โ€” AKS only for GPU, Functions only for events
  3. Retry everything โ€” 3 retries, exponential backoff, always have a fallback
  4. Route by complexity โ€” GPT-4o-mini handles 70%+ of queries at 10% of the cost
  5. Monitor AI-specific metrics โ€” latency, tokens, groundedness, not just HTTP status codes

Previous: T2: Responsible AI. Explore the Solution Plays Overview for production-ready implementations of these patterns.