Why Voice Changes the Agent Game
Text-based agents have proven their value across customer service, support, and automation. Voice agents add a new dimension: they capture intent faster, handle natural interruptions, and feel more human. When someone calls a concierge or service desk, they don’t type word by word. They speak naturally, with context and tone. Voice agents that understand this create a fundamentally different experience.
The challenge is that voice adds real complexity. You need to stream audio efficiently, maintain session state across multiple turns, route requests to specialist agents based on intent, and do all of this with low latency. Azure AI Voice Live, combined with Microsoft Agent Framework running on .NET, makes this practical. You’re building something that can handle real conversations in production, not a prototype.
The Architecture: Voice to Agent to Action
Here’s how the pieces fit together:
- User speaks to a WebSocket client (web, mobile, or desktop)
- Audio streams to your .NET backend via WebSocket
- Your backend forwards audio frames to Azure AI Voice Live
- The realtime model processes audio and returns transcriptions and intent reasoning
- Your agent framework evaluates what to do next, routing to specialist agents if needed
- Results stream back as voice and/or structured data
The key insight: voice agents aren’t transcription plus text processing. The realtime model is reasoning about intent and context in real time. Your agent framework orchestrates what happens next based on that reasoning. This is fundamentally different from batch speech-to-text pipelines.
Setting Up Azure AI Voice Live with .NET
First, you need an Azure AI Services resource with a model deployed to the realtime inference endpoint. This is different from batch inference. You’re deploying to a live endpoint that maintains connection state and streams responses incrementally.
Authentication uses Azure Entra ID. Here’s how to set up the WebSocket connection using the actual Azure SDK:
using Azure.AI.VoiceLive;
using Azure.Identity;
using System;
using System.Threading.Tasks;
public class VoiceAgentSetup
{
public async Task<VoiceLiveClient> CreateVoiceLiveClientAsync(
string resourceName,
string region)
{
// Use DefaultAzureCredential for Entra ID authentication
// This works with az login, managed identities, or environment variables
var credential = new DefaultAzureCredential();
var endpoint = new Uri(
$"https://{resourceName}.cognitiveservices.azure.com/"
);
var client = new VoiceLiveClient(endpoint, credential);
return client;
}
public async Task<VoiceLiveSession> StartSessionWithAgentAsync(
VoiceLiveClient client,
string agentName,
string projectName,
string agentId)
{
// Configure the session to use a hosted agent
var sessionConfig = new AgentSessionConfig
{
AgentName = agentName,
ProjectName = projectName,
AgentVersion = null, // Use latest version
ConversationId = Guid.NewGuid().ToString()
};
// Start the session
var session = await client.StartSessionAsync(
SessionTarget.FromAgent(sessionConfig)
);
return session;
}
}
The Azure SDK handles the WebSocket connection, authentication, and event marshaling for you. You work with a managed session object rather than raw WebSocket frames. This is cleaner and more reliable than building your own WebSocket layer.
Streaming Audio and Handling Real-Time Responses
Audio arrives in chunks, typically 20ms to 100ms at a time. You stream these chunks to the session and listen for events. The realtime model processes audio continuously and emits events as it reasons about intent and generates responses.
Here’s how to handle the event loop:
public class RealtimeVoiceSession
{
private readonly VoiceLiveSession _session;
private readonly AgentOrchestrator _agentOrchestrator;
private readonly VoiceSessionContext _context;
public RealtimeVoiceSession(
VoiceLiveSession session,
AgentOrchestrator agentOrchestrator,
VoiceSessionContext context)
{
_session = session;
_agentOrchestrator = agentOrchestrator;
_context = context;
}
public async Task SendAudioChunkAsync(byte[] audioChunk)
{
// Send PCM16 audio at 16kHz to the session
await _session.SendAudioAsync(audioChunk);
}
public async Task ProcessEventsAsync(CancellationToken ct)
{
// Listen for events from the realtime model
await foreach (var serverEvent in _session.GetEventsAsync(ct))
{
switch (serverEvent)
{
case SessionUpdateResponseOutputItemAdded itemAdded:
// Agent started generating a response
_context.IsAgentSpeaking = true;
break;
case SessionUpdateResponseOutputItemDone itemDone:
// Agent finished this response item
_context.IsAgentSpeaking = false;
break;
case SessionUpdateConversationItemInputAudioTranscriptionCompleted transcription:
// User's speech has been transcribed
await HandleUserTranscriptionAsync(
transcription.Transcript
);
break;
case SessionUpdateResponseAudioDelta audioDelta:
// Incremental audio from the agent
// Queue this for playback
await QueueAudioForPlaybackAsync(audioDelta.AudioData);
break;
case SessionUpdateResponseAudioDone audioDone:
// Agent finished speaking
break;
}
}
}
private async Task HandleUserTranscriptionAsync(string transcript)
{
// User has spoken something
_context.AddTurn("user", transcript);
// Route through the agent orchestrator
var response = await _agentOrchestrator.ProcessAsync(
transcript,
_context
);
_context.AddTurn("assistant", response.Text, response.Intent);
}
private async Task QueueAudioForPlaybackAsync(byte[] audioData)
{
// Send audio to the client for playback
// Implementation depends on your transport (SignalR, WebSocket, etc.)
await Task.CompletedTask;
}
}
The key pattern: you have an async event loop listening for server events. When the user speaks, you get a transcription event. When the agent generates audio, you get audio delta events. This decoupling keeps latency low and lets you handle multiple concurrent operations smoothly.
Managing Conversation State Across Voice Sessions
One of the trickier parts of voice agents is maintaining context. If a user asks “What time does the ski lift open?” and then says “Can I book a lesson there?”, your agent needs to remember which lift they were asking about. This context doesn’t live in a single request. It spans the entire session.
Here’s how to structure session state:
public class VoiceSessionContext
{
public string SessionId { get; set; }
public DateTime StartTime { get; set; }
public List<ConversationTurn> Turns { get; set; } = new();
public Dictionary<string, object> EntityMemory { get; set; } = new();
public string CurrentAgent { get; set; }
public bool IsAgentSpeaking { get; set; }
public void AddTurn(string role, string content, string intent = null)
{
Turns.Add(new ConversationTurn
{
Role = role,
Content = content,
Intent = intent,
Timestamp = DateTime.UtcNow
});
}
public void RememberEntity(string key, object value)
{
EntityMemory[key] = value;
}
public object GetEntity(string key)
{
return EntityMemory.TryGetValue(key, out var value) ? value : null;
}
}
public class ConversationTurn
{
public string Role { get; set; }
public string Content { get; set; }
public string Intent { get; set; }
public DateTime Timestamp { get; set; }
}
Store this in Redis or a distributed cache if you have multiple backend instances. When a connection drops and reconnects, retrieve the context by session ID and restore the conversation. This way, users can pick up where they left off without repeating themselves.
Orchestrating Specialist Agents for Voice
The real power of agent frameworks comes from routing. A voice agent for a ski resort might have specialists for: lift status, lesson booking, equipment rental, weather, and emergency services. Your main agent’s job is to understand what the user is asking about and route to the right specialist.
Here’s the orchestrator pattern:
public class AgentOrchestrator
{
private readonly Dictionary<string, ISpecialistAgent> _agents;
private readonly IIntentClassifier _intentClassifier;
public AgentOrchestrator(
Dictionary<string, ISpecialistAgent> agents,
IIntentClassifier intentClassifier)
{
_agents = agents;
_intentClassifier = intentClassifier;
}
public async Task<AgentResponse> ProcessAsync(
string userInput,
VoiceSessionContext context)
{
// Classify the intent from user input and conversation history
var intent = await _intentClassifier.ClassifyAsync(
userInput,
context
);
// Route to the appropriate specialist agent
if (!_agents.TryGetValue(intent.AgentType, out var agent))
{
return new AgentResponse
{
Text = "I'm not sure how to help with that.",
Intent = "unknown"
};
}
// Execute the specialist agent with full context
var response = await agent.ProcessAsync(userInput, context);
// Update session context with the result
context.AddTurn("assistant", response.Text, intent.AgentType);
if (response.RememberedEntities != null)
{
foreach (var entity in response.RememberedEntities)
{
context.RememberEntity(entity.Key, entity.Value);
}
}
return response;
}
}
public interface ISpecialistAgent
{
Task<AgentResponse> ProcessAsync(
string input,
VoiceSessionContext context
);
}
public class AgentResponse
{
public string Text { get; set; }
public string Intent { get; set; }
public Dictionary<string, object> RememberedEntities { get; set; }
}
public class IntentClassification
{
public string AgentType { get; set; }
public string Intent { get; set; }
public double Confidence { get; set; }
}
public interface IIntentClassifier
{
Task<IntentClassification> ClassifyAsync(
string userInput,
VoiceSessionContext context
);
}
This design lets you add new specialist agents without changing the orchestrator. Each specialist knows its domain. The orchestrator routes and keeps context synchronized. This separation makes your system easier to test and maintain.
Building a Specialist Agent: Lift Status Example
Let’s make this concrete with a real specialist agent that handles lift status queries. This agent queries live telemetry to check which lifts are open and their wait times.
public class LiftStatusAgent : ISpecialistAgent
{
private readonly ILiftTelemetryService _telemetry;
public LiftStatusAgent(ILiftTelemetryService telemetry)
{
_telemetry = telemetry;
}
public async Task<AgentResponse> ProcessAsync(
string input,
VoiceSessionContext context)
{
// Extract lift name from input or context
var liftName = ExtractLiftName(input, context);
if (string.IsNullOrEmpty(liftName))
{
return new AgentResponse
{
Text = "Which lift are you interested in? We have Alpine, Summit, and Village.",
Intent = "lift_status_clarification"
};
}
// Query the telemetry service
var liftStatus = await _telemetry.GetLiftStatusAsync(liftName);
if (liftStatus == null)
{
return new AgentResponse
{
Text = $"I couldn't find a lift called {liftName}.",
Intent = "lift_not_found"
};
}
// Build a natural response
var response = BuildResponse(liftStatus);
// Remember this lift for future context
var rememberedEntities = new Dictionary<string, object>
{
{ "last_lift_queried", liftName },
{ "last_lift_status", liftStatus }
};
return new AgentResponse
{
Text = response,
Intent = "lift_status_provided",
RememberedEntities = rememberedEntities
};
}
private string ExtractLiftName(string input, VoiceSessionContext context)
{
var lifts = new[] { "Alpine", "Summit", "Village" };
foreach (var lift in lifts)
{
if (input.Contains(lift, StringComparison.OrdinalIgnoreCase))
return lift;
}
// Check context from a previous query
if (context.GetEntity("last_lift_queried") is string lastLift)
return lastLift;
return null;
}
private string BuildResponse(LiftStatus status)
{
if (status.IsOpen)
{
return $"The {status.Name} lift is open. " +
$"Current wait time is {status.WaitMinutes} minutes. " +
$"Last updated at {status.LastUpdated:h:mm tt}.";
}
else
{
return $"The {status.Name} lift is currently closed. " +
$"It's scheduled to open at {status.ScheduledOpenTime:h:mm tt}.";
}
}
}
public class LiftStatus
{
public string Name { get; set; }
public bool IsOpen { get; set; }
public int WaitMinutes { get; set; }
public DateTime LastUpdated { get; set; }
public DateTime ScheduledOpenTime { get; set; }
}
public interface ILiftTelemetryService
{
Task<LiftStatus> GetLiftStatusAsync(string liftName);
}
This agent does something real: it queries live data and builds a conversational response. If the user’s next message is “Can I book a lesson there?”, your context still remembers which lift they asked about. The next specialist agent has that context without the user repeating themselves.
Handling Interruptions and Barge-In
Real conversations have interruptions. The user might start speaking before the agent finishes. Azure AI Voice Live handles some of this natively through voice activity detection, but your application should handle it gracefully too.
When you detect that the user has started speaking while the agent is still generating a response, you have options:
- Stop the current response and start processing the new input
- Queue the new input and process it after the current response completes
- Let the realtime model handle it automatically (it often does)
Here’s how to implement barge-in detection:
public class BargeInHandler
{
public async Task<bool> DetectBargeInAsync(
byte[] audioChunk,
bool isAgentSpeaking)
{
if (!isAgentSpeaking)
return false;
// Detect if the user has started speaking
var hasUserSpeech = await DetectSpeechActivityAsync(audioChunk);
return hasUserSpeech;
}
private async Task<bool> DetectSpeechActivityAsync(byte[] audioChunk)
{
// Simple volume-based detection or use Azure Cognitive Services VAD
var rmsLevel = CalculateRMS(audioChunk);
return rmsLevel > 0.05;
}
private double CalculateRMS(byte[] audioChunk)
{
double sum = 0;
for (int i = 0; i < audioChunk.Length; i += 2)
{
short sample = BitConverter.ToInt16(audioChunk, i);
sum += sample * sample;
}
return Math.Sqrt(sum / (audioChunk.Length / 2.0));
}
}
Barge-in is subtle but important. If your agent sounds like it’s ignoring the user’s attempt to speak, the experience feels broken. Handle it well and conversations feel genuinely interactive.
Putting It Together: A Complete Voice Session Handler
Here’s how these pieces work together in a real request handler:
public class VoiceAgentHub : Hub
{
private readonly VoiceAgentFactory _agentFactory;
private readonly SessionManager _sessionManager;
private readonly BargeInHandler _bargeInHandler;
public VoiceAgentHub(
VoiceAgentFactory agentFactory,
SessionManager sessionManager,
BargeInHandler bargeInHandler)
{
_agentFactory = agentFactory;
_sessionManager = sessionManager;
_bargeInHandler = bargeInHandler;
}
public async Task StartVoiceSession()
{
var sessionId = Guid.NewGuid().ToString();
var context = new VoiceSessionContext { SessionId = sessionId };
var voiceAgent = await _agentFactory.CreateAsync(context);
await _sessionManager.StoreAsync(sessionId, voiceAgent);
await Clients.Caller.SendAsync("sessionStarted", sessionId);
}
public async Task ReceiveAudioChunk(string sessionId, byte[] audioData)
{
var voiceAgent = await _sessionManager.GetAsync(sessionId);
if (voiceAgent == null)
return;
// Detect barge-in if agent is speaking
var isBargeIn = await _bargeInHandler.DetectBargeInAsync(
audioData,
voiceAgent.IsAgentSpeaking
);
if (isBargeIn)
{
voiceAgent.CancelCurrentResponse();
}
await voiceAgent.SendAudioChunkAsync(audioData);
}
public async Task EndSession(string sessionId)
{
var voiceAgent = await _sessionManager.GetAsync(sessionId);
if (voiceAgent != null)
{
await voiceAgent.EndSessionAsync();
await _sessionManager.RemoveAsync(sessionId);
}
}
}
This is a SignalR hub that clients connect to. The client sends audio chunks in real time, the hub routes them to the voice agent, and responses come back as they’re generated. This is the glue between your web or mobile frontend and your voice agent backend.
Deployment Considerations
Running voice agents in production requires attention to a few specific areas:
Latency. The WebSocket connection to Azure AI Voice Live should have low latency. Deploy your backend in the same region as your Voice Live resource. If you’re serving users in the UAE, use the appropriate regional endpoint. Monitor connection latency and audio processing time.
Session persistence. If you’re running multiple backend instances, store session state in Redis or a distributed cache, not in memory. When a connection drops and reconnects, retrieve the context and restore conversation state. This is non-negotiable for production reliability.
Audio quality. Ensure your clients send audio at 16-bit PCM, 16kHz mono. If audio quality is poor, the model’s transcriptions and reasoning suffer. Test with real network conditions and various audio sources.
Monitoring and logging. Log every conversation turn, every intent classification, and every specialist agent invocation. Include timing information: how long did intent classification take, how long to route to an agent, how long to generate a response. When something goes wrong, you need to understand what happened and why.
Real-World Tradeoffs
Voice agents are powerful but not magic. Here are the tradeoffs you’ll face:
Accuracy versus speed. You can improve intent classification by using more context and more complex models, but that takes time. Users expect responses in under a second. Find the balance for your use case. Sometimes a fast, good-enough answer beats a slow, perfect one.
Specialist agents versus general purpose. Many small specialist agents are easier to reason about and test, but add routing overhead. One large general-purpose agent is simpler but harder to maintain. Most teams find a middle ground: a handful of well-scoped specialists that handle 80% of requests.
Audio processing locally versus server-side. Processing audio on the client (voice activity detection, echo cancellation) reduces server load but increases client complexity. For web browsers, this is often not feasible. For mobile apps, it’s more practical. Choose based on your deployment target.
Conversation length. The longer a conversation, the more context you carry forward. At some point, the context gets too large to send with every request. Implement summarization or sliding window strategies for long sessions. This is especially important for voice, where users might have extended conversations.
Conclusion
Voice-enabled agents in .NET are becoming practical for real applications. Azure AI Voice Live removes the complexity of building your own speech recognition and reasoning pipeline. Microsoft Agent Framework provides the orchestration layer. What’s left for you is the application logic: routing decisions, specialist agents, context management, and the conversational design that makes your agent useful.
Start small. Pick one specialist agent and get it working well. Then add more. Test with real users and iterate on the conversation flows. Voice is more natural than text, but it’s also less forgiving of mistakes. Get it right, and your users will prefer talking to your agent over typing.
The architecture is straightforward: WebSocket connection, streaming audio, real-time responses, session state, specialist routing. Build these pieces carefully, and you’ll have a voice agent that feels responsive and intelligent.
What audio format does Azure AI Voice Live expect?
Azure AI Voice Live expects 16-bit PCM audio at 16kHz mono. Your client is responsible for capturing and encoding audio in that format before sending it over the WebSocket. The realtime model doesn’t do the encoding for you, but it does handle the speech recognition and reasoning on the server side.
How do I authenticate with Azure AI Voice Live from .NET?
Use Azure Entra ID authentication with DefaultAzureCredential. This works with az login, managed identities, or environment variables. The Azure SDK handles the authentication handshake. Key-based authentication is not supported for agent mode in Voice Live.
How do I handle cases where the user’s intent is ambiguous?
When intent classification is uncertain, ask for clarification rather than guessing. Store the ambiguous input in the session context, and let the user’s next response refine the intent. For example, if the user says “book something”, ask “Would you like to book a ski lesson or equipment rental?” Their response gives you the clarity you need.
What happens if the WebSocket connection drops mid-conversation?
Store your session context in a persistent store like Redis keyed by session ID. When the client reconnects, retrieve the context and restore the conversation state. You can then continue where you left off. Make sure your frontend has reconnection logic with exponential backoff.
Can I use the same session context for both voice and text interactions?
Yes. Both voice and text sessions can share the same underlying session context and specialist agents. The difference is how the input arrives (audio stream versus text) and how the output is delivered (voice synthesis versus text). The business logic in your agents stays the same.
How do I test voice agents locally before deploying to Azure?
Mock the Azure AI Voice Live WebSocket responses in your unit tests. For integration tests, use the Azure SDK’s testing utilities or record real conversations and replay them. For end-to-end testing, deploy to a staging environment with the real Azure endpoint and test with a small set of users first.