Why Model Routing Matters in Production AI Agents
When you build an AI agent that needs to handle thousands of requests, you quickly discover that relying on a single LLM provider or model creates fragility. One provider hiccup means your entire system stalls. Costs spike unpredictably. And a model that works brilliantly for complex reasoning might be overkill and expensive for simple classification tasks.
Model routing and fallback strategies solve these problems by letting you define policies that automatically select the right model for each task, switch to alternatives when something fails, and optimize spending without compromising quality. In Microsoft Agent Framework, this happens at the IChatClient level, which means you can layer this logic cleanly using dependency injection and middleware patterns.
Setting Up Microsoft Foundry as Your LLM Backend
Microsoft Foundry provides a unified interface to multiple LLM providers. Instead of hardcoding calls to Azure OpenAI, OpenAI, or other services directly, you configure them through Foundry, which gives you flexibility and consistency.
Start with a basic IChatClient configuration in your dependency injection container:
services.AddChatClient("default")
.UseAzureOpenAI(
deploymentName: "gpt-4",
endpoint: new Uri("https://your-resource.openai.azure.com"),
apiKey: "your-api-key"
);
This creates a single chat client. But for production, you need more. You need to define which models are available, when to use them, and what to do when they fail.
Implementing Fallback Chains
A fallback chain is an ordered list of models to try in sequence. For example: attempt GPT-4 first (best quality, highest cost), then GPT-4o (good quality, moderate cost), then GPT-3.5 Turbo (lower quality, lowest cost).
You implement this by wrapping your chat client with fallback logic. Here’s a pattern that works well:
public class FallbackChatClient : IChatClient
{
private readonly List<IChatClient> _clients;
private readonly ILogger<FallbackChatClient> _logger;
public FallbackChatClient(List<IChatClient> clients,
ILogger<FallbackChatClient> logger)
{
_clients = clients;
_logger = logger;
}
public async IAsyncEnumerable<StreamingChatCompletionUpdate>
CompleteStreamingAsync(
IList<ChatMessage> messages,
ChatOptions? options = null,
[EnumeratorCancellation] CancellationToken cancellationToken = default)
{
foreach (var client in _clients)
{
try
{
await foreach (var update in client.CompleteStreamingAsync(
messages, options, cancellationToken))
{
yield return update;
}
yield break;
}
catch (Exception ex)
{
_logger.LogWarning(
$"Client failed, trying next: {ex.Message}");
}
}
throw new InvalidOperationException(
"All fallback clients exhausted.");
}
}
This wrapper tries each client in order. If one throws an exception, it logs the failure and moves to the next. If all fail, it throws. You can enhance this with backoff, circuit breakers, or health checks depending on your needs.
Cost-Aware Model Selection
Not every request needs your most capable model. A simple question might be answered perfectly by GPT-3.5 Turbo at a fraction of the cost. You can implement cost-aware routing by analyzing the request before selecting a model.
One approach is to classify the request complexity and route accordingly. This example uses a simplified heuristic based on input characteristics:
public interface IModelSelector
{
string SelectModel(string userInput);
}
public class ComplexityBasedModelSelector : IModelSelector
{
public string SelectModel(string userInput)
{
var complexity = EstimateComplexity(userInput);
return complexity switch
{
Complexity.Simple => "gpt-35-turbo",
Complexity.Moderate => "gpt-4o",
Complexity.Complex => "gpt-4",
_ => "gpt-4o"
};
}
private Complexity EstimateComplexity(string input)
{
var tokenCount = input.Split().Length;
var hasCode = input.Contains("code") || input.Contains("function");
var hasReasoning = input.Contains("why") || input.Contains("explain");
if (hasCode || hasReasoning || tokenCount > 100)
return Complexity.Complex;
if (tokenCount > 50 || hasReasoning)
return Complexity.Moderate;
return Complexity.Simple;
}
}
enum Complexity { Simple, Moderate, Complex }
In production, you might replace this heuristic with a trained model or leverage Microsoft Foundry’s built-in Model Router, which uses machine learning to analyze prompts and select the best model automatically. Model Router supports three routing modes: Balanced (optimizes cost and quality), Cost (favors cheaper models), and Quality (always selects the highest-capability model).
Then in your agent or chat handler, use the selector to pick a model before invoking the chat client:
public class AgentChatHandler
{
private readonly IChatClient _chatClient;
private readonly IModelSelector _modelSelector;
public AgentChatHandler(IChatClient chatClient,
IModelSelector modelSelector)
{
_chatClient = chatClient;
_modelSelector = modelSelector;
}
public async Task<string> HandleUserInputAsync(string userInput)
{
var selectedModel = _modelSelector.SelectModel(userInput);
var options = new ChatOptions { Model = selectedModel };
var messages = new List<ChatMessage>
{
new UserChatMessage(userInput)
};
var response = await _chatClient.CompleteAsync(
messages, options);
return response.Content[0].Text;
}
}
This approach reduces costs by matching model capability to task difficulty. A classification task that GPT-3.5 can handle well costs significantly less than using GPT-4.
Handling Provider Outages
When your primary LLM provider experiences an outage, you need a way to continue serving requests. Fallback chains help, but you also want to detect and respond to outages proactively.
A simple health check pattern can monitor provider availability:
public class ProviderHealthCheck
{
private readonly IChatClient _chatClient;
private bool _isHealthy = true;
private DateTime _lastCheckTime = DateTime.MinValue;
private readonly TimeSpan _checkInterval = TimeSpan.FromMinutes(5);
public ProviderHealthCheck(IChatClient chatClient)
{
_chatClient = chatClient;
}
public async Task<bool> IsHealthyAsync()
{
if (DateTime.UtcNow - _lastCheckTime < _checkInterval)
return _isHealthy;
try
{
var testMessages = new List<ChatMessage>
{
new UserChatMessage("Respond with 'ok'")
};
await _chatClient.CompleteAsync(testMessages);
_isHealthy = true;
_lastCheckTime = DateTime.UtcNow;
return true;
}
catch (Exception)
{
_isHealthy = false;
_lastCheckTime = DateTime.UtcNow;
return false;
}
}
}
You can periodically run this check and update a status flag that your fallback logic checks before trying a client. This avoids wasting time on requests that will definitely fail.
Dependency Injection Setup
Bringing this all together in your service configuration ensures clean, testable code:
public static class AgentFrameworkExtensions
{
public static IServiceCollection AddMultiModelAgent(
this IServiceCollection services)
{
// Register individual chat clients
services.AddChatClient("gpt4")
.UseAzureOpenAI(
deploymentName: "gpt-4",
endpoint: new Uri("https://gpt4.openai.azure.com"),
apiKey: "key1"
);
services.AddChatClient("gpt4o")
.UseAzureOpenAI(
deploymentName: "gpt-4o",
endpoint: new Uri("https://gpt4o.openai.azure.com"),
apiKey: "key2"
);
services.AddChatClient("gpt35")
.UseAzureOpenAI(
deploymentName: "gpt-35-turbo",
endpoint: new Uri("https://gpt35.openai.azure.com"),
apiKey: "key3"
);
// Register model selector
services.AddSingleton<IModelSelector,
ComplexityBasedModelSelector>();
// Register health checks
services.AddSingleton<ProviderHealthCheck>();
// Register the fallback wrapper
services.AddSingleton<IChatClient>(sp =>
{
var clients = new List<IChatClient>
{
sp.GetRequiredService<IChatClient>("gpt4"),
sp.GetRequiredService<IChatClient>("gpt4o"),
sp.GetRequiredService<IChatClient>("gpt35")
};
var logger = sp.GetRequiredService<
ILogger<FallbackChatClient>>();
return new FallbackChatClient(clients, logger);
});
return services;
}
}
Then in your startup:
builder.Services.AddMultiModelAgent();
Now anywhere you inject IChatClient, you get the multi-model fallback behavior automatically.
Patterns in Practice
These routing and fallback patterns apply across different agent scenarios. Whether you’re building interview coaches, customer support agents, or document analysis tools, the same principles hold: define your model hierarchy, implement fallback logic, and monitor which models are actually being used.
The key is starting simple. Begin with a basic fallback chain (primary model, secondary model, tertiary model), then add complexity as you learn what your agents actually need. Cost-aware routing can wait until you have baseline metrics. Health checks become valuable once you’re running at scale.
Monitoring and Observability
Once you have routing and fallback in place, you need visibility into what’s happening. Log which model was selected, whether fallbacks were triggered, and how long each request took:
public class ObservableFallbackChatClient : IChatClient
{
private readonly List<(IChatClient client, string name)> _clients;
private readonly ILogger<ObservableFallbackChatClient> _logger;
public ObservableFallbackChatClient(
List<(IChatClient, string)> clients,
ILogger<ObservableFallbackChatClient> logger)
{
_clients = clients;
_logger = logger;
}
public async IAsyncEnumerable<StreamingChatCompletionUpdate>
CompleteStreamingAsync(
IList<ChatMessage> messages,
ChatOptions? options = null,
[EnumeratorCancellation] CancellationToken cancellationToken = default)
{
var sw = Stopwatch.StartNew();
foreach (var (client, name) in _clients)
{
try
{
_logger.LogInformation(
$"Attempting model: {name}");
await foreach (var update in client.CompleteStreamingAsync(
messages, options, cancellationToken))
{
yield return update;
}
sw.Stop();
_logger.LogInformation(
$"Success with {name} in {sw.ElapsedMilliseconds}ms");
yield break;
}
catch (Exception ex)
{
_logger.LogWarning(
$"Model {name} failed: {ex.Message}. " +
$"Elapsed: {sw.ElapsedMilliseconds}ms");
}
}
throw new InvalidOperationException(
"All fallback models exhausted.");
}
}
This logging gives you visibility into which models are being used, when fallbacks activate, and performance characteristics. Over time, this data helps you refine your routing policies.
Key Takeaways
Model routing and fallback strategies transform your AI agents from fragile single-model systems into resilient, cost-efficient production services. By configuring Microsoft Foundry as your LLM backend and layering routing logic on top of IChatClient, you gain:
- Automatic fallback when a model or provider fails, keeping your service running
- Cost optimization by matching model capability to task complexity
- Flexibility to swap providers or models without changing application code
- Observable, loggable behavior that helps you understand and improve your agent’s performance
Start simple with a basic fallback chain, then add complexity as you learn what your agents actually need. The dependency injection patterns make it easy to evolve your strategy over time.
What is the difference between model routing and fallback strategies?
Model routing actively selects the best model for each request based on factors like task complexity or cost constraints. Fallback strategies define what to do when a model fails, trying alternatives in sequence. They work together: routing chooses your first attempt, fallback kicks in if that attempt fails.
Can I use Microsoft Agent Framework with non-Azure LLM providers?
Yes. Microsoft Foundry supports multiple providers. You can configure OpenAI, Anthropic, or other services alongside Azure OpenAI in the same agent, giving you flexibility in your fallback chains and routing policies.
How do I avoid wasting requests on a provider that is down?
Implement a simple health check that periodically pings your primary provider with a lightweight test request. If it fails, skip that provider in your fallback chain for a period of time. This avoids burning through timeouts and quota on requests that will fail anyway.
What happens if all my fallback models fail?
Your application should decide how to handle this gracefully. Options include returning a cached response, queuing the request for retry, returning a user-friendly error message, or escalating to a human. The pattern shown in this article throws an exception, which you can catch and handle according to your needs.
Does model routing add latency to my requests?
The complexity-based selector shown here runs a lightweight analysis on the input text, which is negligible. If you implement more sophisticated selection logic, you might add a few milliseconds. The benefit of cost savings and improved resource utilization typically outweighs this small latency cost.
What is Microsoft Foundry’s Model Router?
Model Router is a built-in feature in Microsoft Foundry that uses machine learning to automatically select the best model for each prompt. It analyzes the full request (system message, user message, tools, conversation history) and routes to the most suitable model based on your chosen routing mode: Balanced (default, optimizes cost and quality), Cost (favors cheaper models), or Quality (always uses the highest-capability model). This is a higher-level alternative to manual routing logic.