Error Handling & Recovery Patterns
Production-grade error handling for MCP servers, FAI Engine, Azure SDK calls, and LLM API interactions.
Error Sources in AI Systemsโ
| Source | Example | Frequency |
|---|---|---|
| LLM API | Rate limits, timeout, content filter | High |
| Azure SDK | Transient network, auth expiry | Medium |
| MCP transport | Connection drop, malformed JSON | Medium |
| User input | Prompt injection, invalid queries | High |
| Infrastructure | Cold start, memory pressure | Low |
Pattern 1: Retry with Exponential Backoffโ
Pythonโ
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type(TransientError)
)
async def call_openai(client, messages, max_tokens=500):
try:
response = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=max_tokens,
timeout=30
)
return response.choices[0].message.content
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
raise TransientError(f"Rate limited: {e}")
if e.response.status_code >= 500:
raise TransientError(f"Server error: {e}")
raise # Non-retryable
Node.jsโ
async function callOpenAI(client, messages, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages,
max_tokens: 500,
});
return response.choices[0].message.content;
} catch (error) {
const status = error?.status;
if ((status === 429 || status >= 500) && attempt < maxRetries) {
const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
await new Promise(r => setTimeout(r, delay));
continue;
}
throw error;
}
}
}
Pattern 2: Circuit Breakerโ
import time
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = 0
self.state = "closed" # closed | open | half-open
def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "half-open"
else:
raise Exception("Circuit breaker OPEN")
try:
result = func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise
Pattern 3: MCP Server Error Handlingโ
@mcp.tool()
async def search_knowledge(query: str, max_results: int = 5) -> str:
"""Search FROOT knowledge modules."""
if not query or len(query) > 500:
return '{"error": "Query must be 1-500 characters"}'
try:
results = perform_search(query, max_results)
return json.dumps({"results": results})
except FileNotFoundError:
return json.dumps({"error": "Knowledge base not found"})
except Exception as e:
logger.error(f"Search failed: {e}", exc_info=True)
return json.dumps({"error": "Search temporarily unavailable"})
:::warning Never Raise in MCP Tools MCP tools must return JSON errors, never propagate exceptions. The AI model can't recover from a crashed tool. :::
Pattern 4: Timeout Wrapperโ
function withTimeout(promise, ms, label = 'Operation') {
let timer;
const timeout = new Promise((_, reject) => {
timer = setTimeout(() => reject(new Error(`${label} timed out after ${ms}ms`)), ms);
});
return Promise.race([promise, timeout]).finally(() => clearTimeout(timer));
}
const result = await withTimeout(callOpenAI(client, messages), 30000, 'Azure OpenAI');
Decision Matrixโ
| Error Type | Retry? | Fallback | User Message |
|---|---|---|---|
| 429 Rate Limit | โ backoff | Queue request | "Please wait a moment" |
| 500 Server Error | โ 3 attempts | Cached response | "Temporarily unavailable" |
| 401 Auth Expired | โ | Refresh token | "Please re-authenticate" |
| 400 Bad Request | โ | Fix request | "Invalid input: [details]" |
| Timeout | โ 1 retry | Cached response | "Request took too long" |
| Content Filter | โ | Rephrase | "Content could not be processed" |
Best Practicesโ
- Always set
max_tokensโ prevent token budget overruns - Always set timeouts โ no call should wait forever
- Retry only transient errors โ 429, 500+, network timeouts
- Never retry 400/401/403 โ these are permanent failures
- Log structured JSON โ not console.log strings
- Include correlation IDs โ trace errors across services
- Validate at boundaries โ MCP tool inputs, API params, user queries
- Degrade gracefully โ cached response > simpler model > error message
See Alsoโ
- Build an MCP Server โ MCP error patterns
- Reliability WAF โ reliability pillar