Streaming AI Agent Responses in .NET with Server-Sent Events

Learn to stream AI agent responses in real-time using SSE and .NET. Reduce latency, handle backpressure, and deliver tokens as they’re generated.

0

Why Streaming Matters for AI Agents

When you call an LLM from your .NET application, the model doesn’t return the complete response instantly. It generates tokens one at a time, often over several seconds. If your ASP.NET Core endpoint waits for the entire response before sending anything to the client, users see nothing until the full response arrives. Then suddenly, a wall of text appears.

Streaming changes this entirely. Instead of waiting for completion, you send each token to the client as it arrives. The user sees the response building in real-time, which makes the application feel responsive even if the total time is identical. It’s a perception shift that significantly improves user experience in enterprise AI systems.

This article walks through implementing streaming responses using Server-Sent Events (SSE), handling token buffering, managing backpressure, and consuming streams on the client side.

Understanding Server-Sent Events for Streaming

Server-Sent Events is a simple HTTP-based protocol designed for one-way server-to-client streaming. Unlike WebSockets, SSE requires no upgrade handshake. The client opens a connection, and the server pushes data whenever it’s ready. This simplicity makes it ideal for streaming LLM tokens.

The protocol is straightforward. The server sends data with this format:

data: {"token": "Hello", "index": 0}

data: {"token": " world", "index": 1}

Each message ends with a blank line. The client listens using the native EventSource API in JavaScript or equivalent libraries in other platforms.

SSE is stateless from the HTTP perspective. The server doesn’t maintain connection state in memory. This makes it suitable for load-balanced scenarios where requests might move between instances.

Building a Streaming Endpoint in ASP.NET Core

Let’s build a practical endpoint that streams responses. This example uses Microsoft.Extensions.AI, which provides a provider-agnostic abstraction for LLM interactions with built-in streaming support.

[HttpPost("api/chat/stream")]
public async Task StreamChatResponse(
    [FromBody] ChatRequest request,
    IChatClient chatClient,
    CancellationToken cancellationToken)
{
    Response.ContentType = "text/event-stream";
    Response.Headers.Add("Cache-Control", "no-cache");
    Response.Headers.Add("Connection", "keep-alive");
    
    try
    {
        var messages = new List<ChatMessage>
        {
            new ChatMessage(ChatRole.System, "You are a helpful assistant."),
            new ChatMessage(ChatRole.User, request.Question)
        };
        
        var tokenStream = chatClient.GetStreamingResponseAsync(
            messages,
            cancellationToken);
        
        int tokenIndex = 0;
        await foreach (var update in tokenStream)
        {
            if (!string.IsNullOrEmpty(update.Text))
            {
                var eventData = new
                {
                    token = update.Text,
                    index = tokenIndex,
                    timestamp = DateTime.UtcNow
                };
                
                var json = JsonSerializer.Serialize(eventData);
                await Response.WriteAsync($"data: {json}

", cancellationToken);
                await Response.Body.FlushAsync(cancellationToken);
                
                tokenIndex++;
            }
        }
        
        await Response.WriteAsync("data: {"done":true}

", cancellationToken);
    }
    catch (OperationCanceledException)
    {
        await Response.WriteAsync(
            "data: {"error": "Stream cancelled"}

",
            cancellationToken);
    }
    catch (Exception ex)
    {
        await Response.WriteAsync(
            $"data: {{"error": "{ex.Message}"}}

",
            cancellationToken);
    }
}

The key points here:

  • Set the response content type to text/event-stream
  • Add Cache-Control and Connection headers to keep the connection alive
  • Use GetStreamingResponseAsync from IChatClient to get an async enumerable of response updates
  • Write each token update as a data line followed by a blank line
  • Flush after each write to ensure immediate delivery to the client
  • Handle cancellation gracefully

The IChatClient.GetStreamingResponseAsync method is an async enumerable that yields response updates as tokens arrive from the LLM. This abstraction works with Azure OpenAI, OpenAI, and other providers.

Token Buffering and Batching

Sending every single token immediately can create network overhead. In many cases, batching a few tokens together before sending reduces the total number of writes while keeping latency low.

[HttpPost("api/chat/stream-buffered")]
public async Task StreamChatResponseBuffered(
    [FromBody] ChatRequest request,
    IChatClient chatClient,
    CancellationToken cancellationToken)
{
    Response.ContentType = "text/event-stream";
    Response.Headers.Add("Cache-Control", "no-cache");
    Response.Headers.Add("Connection", "keep-alive");
    
    var buffer = new StringBuilder();
    const int bufferThreshold = 5;
    const int maxWaitMs = 100;
    int tokenCount = 0;
    
    try
    {
        var messages = new List<ChatMessage>
        {
            new ChatMessage(ChatRole.System, "You are a helpful assistant."),
            new ChatMessage(ChatRole.User, request.Question)
        };
        
        var tokenStream = chatClient.GetStreamingResponseAsync(
            messages,
            cancellationToken);
        
        var lastFlush = DateTime.UtcNow;
        
        await foreach (var update in tokenStream)
        {
            if (!string.IsNullOrEmpty(update.Text))
            {
                buffer.Append(update.Text);
                tokenCount++;
                
                bool shouldFlush = tokenCount >= bufferThreshold ||
                    (DateTime.UtcNow - lastFlush).TotalMilliseconds >= maxWaitMs;
                
                if (shouldFlush && buffer.Length > 0)
                {
                    var eventData = new
                    {
                        chunk = buffer.ToString(),
                        tokens = tokenCount
                    };
                    
                    var json = JsonSerializer.Serialize(eventData);
                    await Response.WriteAsync(
                        $"data: {json}

",
                        cancellationToken);
                    await Response.Body.FlushAsync(cancellationToken);
                    
                    buffer.Clear();
                    tokenCount = 0;
                    lastFlush = DateTime.UtcNow;
                }
            }
        }
        
        if (buffer.Length > 0)
        {
            var eventData = new { chunk = buffer.ToString(), tokens = tokenCount };
            var json = JsonSerializer.Serialize(eventData);
            await Response.WriteAsync($"data: {json}

", cancellationToken);
            await Response.Body.FlushAsync(cancellationToken);
        }
        
        await Response.WriteAsync("data: {"done":true}

", cancellationToken);
    }
    catch (Exception ex)
    {
        await Response.WriteAsync(
            $"data: {{"error": "{ex.Message}"}}

",
            cancellationToken);
    }
}

This approach accumulates tokens in a buffer and flushes when either a token threshold is reached or a time window expires. This reduces network calls while maintaining perceived responsiveness.

Handling Backpressure

Backpressure occurs when the client can’t consume data as fast as the server produces it. With streaming, this means tokens pile up in memory while waiting to be written. You need to handle this gracefully.

ASP.NET Core’s Response.Body.FlushAsync already applies backpressure by waiting if the TCP send buffer is full. However, you can add explicit monitoring:

private async Task WriteStreamDataWithBackpressure(
    HttpResponse response,
    string data,
    CancellationToken cancellationToken)
{
    try
    {
        await response.WriteAsync(data, cancellationToken);
        
        if (cancellationToken.IsCancellationRequested)
        {
            throw new OperationCanceledException("Client disconnected");
        }
        
        var flushTask = response.Body.FlushAsync(cancellationToken);
        var completedTask = await Task.WhenAny(
            flushTask,
            Task.Delay(5000, cancellationToken));
        
        if (completedTask != flushTask)
        {
            throw new TimeoutException(
                "Response flush exceeded timeout. Client may be disconnected.");
        }
    }
    catch (IOException ex)
    {
        throw new OperationCanceledException(
            "Unable to write to response stream.", ex);
    }
}

In production, also consider setting TCP_NODELAY on the socket to reduce latency and implementing monitoring for slow clients that might be consuming tokens too slowly.

Client-Side Consumption

On the frontend, consuming the stream is straightforward with the EventSource API. Here’s a practical example:

async function streamChatResponse(message) {
    const responseContainer = document.getElementById('response');
    responseContainer.innerHTML = '';
    
    const eventSource = new EventSource(
        `/api/chat/stream?message=${encodeURIComponent(message)}`
    );
    
    eventSource.addEventListener('message', (event) => {
        try {
            const data = JSON.parse(event.data);
            
            if (data.done) {
                eventSource.close();
                return;
            }
            
            if (data.error) {
                responseContainer.innerHTML += 
                    `<p class="error">Error: ${data.error}</p>`;
                eventSource.close();
                return;
            }
            
            responseContainer.innerHTML += escapeHtml(data.chunk || data.token);
        } catch (e) {
            console.error('Failed to parse event data:', e);
        }
    });
    
    eventSource.addEventListener('error', () => {
        responseContainer.innerHTML += 
            '<p class="error">Connection lost.</p>';
        eventSource.close();
    });
}

function escapeHtml(text) {
    const div = document.createElement('div');
    div.textContent = text;
    return div.innerHTML;
}

The EventSource API automatically handles reconnection if the connection drops. For more control over buffering and to support POST requests, you can use fetch with a ReadableStream instead:

async function streamWithFetch(message) {
    const response = await fetch('/api/chat/stream', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ message })
    });
    
    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let buffer = '';
    
    try {
        while (true) {
            const { done, value } = await reader.read();
            if (done) break;
            
            buffer += decoder.decode(value, { stream: true });
            
            const lines = buffer.split('

');
            buffer = lines[lines.length - 1];
            
            for (let i = 0; i < lines.length - 1; i++) {
                const line = lines[i].replace(/^data: /, '').trim();
                if (line) {
                    processStreamEvent(JSON.parse(line));
                }
            }
        }
    } finally {
        reader.releaseLock();
    }
}

Fetch with ReadableStream gives you more granular control over buffering and allows POST requests, which is cleaner for passing complex request bodies.

Performance Considerations

Streaming adds several layers of I/O. Here are practical considerations:

Memory Usage: With streaming, you’re not holding the entire response in memory. However, if tokens arrive faster than they’re sent, they queue in the TCP send buffer. Set reasonable buffer sizes and monitor memory under load.

CPU Impact: JSON serialization happens per token or per batch. For high-throughput scenarios, consider sending raw tokens and parsing less frequently on the client.

Network Efficiency: SSE adds overhead compared to a single large response because of repeated headers. Batching tokens as shown earlier helps significantly.

Connection Limits: SSE keeps connections open. Ensure your server has adequate connection pool capacity. Kestrel handles this well, but be aware in containerized environments where connection limits may be tighter.

Timeout Configuration: Long-lived streaming connections can hit timeouts. Configure your load balancer, reverse proxy, and application to allow sufficiently long timeouts. In ASP.NET Core, set RequestTimeout appropriately:

builder.Services.Configure<KestrelServerOptions>(options =>
{
    options.Limits.RequestHeadersTimeout = TimeSpan.FromSeconds(60);
    options.Limits.KeepAliveTimeout = TimeSpan.FromSeconds(120);
});

Error Handling and Resilience

Streaming introduces failure modes that don’t exist with traditional request-response. Here’s a robust pattern:

public async Task StreamWithResilience(
    [FromBody] ChatRequest request,
    IChatClient chatClient,
    CancellationToken cancellationToken)
{
    Response.ContentType = "text/event-stream";
    Response.Headers.Add("Cache-Control", "no-cache");
    
    var messages = new List<ChatMessage>
    {
        new ChatMessage(ChatRole.System, "You are a helpful assistant."),
        new ChatMessage(ChatRole.User, request.Question)
    };
    
    var maxRetries = 3;
    var retryCount = 0;
    
    while (retryCount < maxRetries)
    {
        try
        {
            var tokenStream = chatClient.GetStreamingResponseAsync(
                messages,
                cancellationToken);
            
            await foreach (var update in tokenStream)
            {
                if (!string.IsNullOrEmpty(update.Text))
                {
                    try
                    {
                        var json = JsonSerializer.Serialize(
                            new { token = update.Text });
                        await Response.WriteAsync(
                            $"data: {json}

",
                            cancellationToken);
                        await Response.Body.FlushAsync(cancellationToken);
                    }
                    catch (IOException)
                    {
                        return;
                    }
                }
            }
            
            await Response.WriteAsync("data: {"done":true}

", cancellationToken);
            return;
        }
        catch (HttpRequestException ex) when (retryCount < maxRetries - 1)
        {
            retryCount++;
            await Task.Delay(1000 * retryCount, cancellationToken);
        }
        catch (Exception ex)
        {
            await Response.WriteAsync(
                $"data: {{"error": "{ex.Message}"}}

",
                cancellationToken);
            return;
        }
    }
}

The pattern here distinguishes between transient failures (which can be retried) and permanent failures (which should terminate the stream). Client disconnections are detected via IOException and handled gracefully.

Monitoring and Observability

Streaming responses require different monitoring than traditional endpoints because completion doesn’t correspond to a single response. Add structured logging to track stream lifecycle:

var streamId = Guid.NewGuid().ToString();
_logger.LogInformation(
    "Stream started: {StreamId}",
    streamId);

var sw = Stopwatch.StartNew();
int tokensSent = 0;

await foreach (var update in tokenStream)
{
    if (!string.IsNullOrEmpty(update.Text))
    {
        tokensSent++;
        await Response.WriteAsync($"data: {update.Text}

", cancellationToken);
    }
}

sw.Stop();
_logger.LogInformation(
    "Stream completed: {StreamId}, tokens: {TokenCount}, duration: {Duration}ms",
    streamId,
    tokensSent,
    sw.ElapsedMilliseconds);

Track metrics like tokens per second, total stream duration, and error rates. This helps identify performance bottlenecks and client issues in production.

Putting It Together

Streaming AI responses in .NET transforms how users experience your applications. The pattern is straightforward: stream tokens via SSE, batch for efficiency, handle backpressure, and consume on the client with EventSource or fetch. The complexity is manageable, and the perceived performance gain is significant.

Start with simple token-by-token streaming, then optimize with batching based on your actual latency requirements. Monitor real-world performance, adjust buffer sizes, and add resilience as needed. The investment pays dividends in user satisfaction and application responsiveness.

What’s the difference between SSE and WebSockets for streaming?

SSE is simpler and stateless. The server sends data without maintaining connection state, making it easier to scale across multiple instances. WebSockets provide bidirectional communication, which you don’t need for one-way token streaming. SSE is the better choice for streaming LLM responses unless you need real-time client-to-server communication.

How do I handle a client disconnect mid-stream?

The client disconnect is detected when you try to write to the response stream, which throws an IOException. Catch this exception and return gracefully. The token generation on the server continues until the next write attempt fails, so there’s minimal wasted work.

Should I batch all tokens or send them individually?

It depends on your latency requirements. Sending individual tokens maximizes perceived responsiveness but increases network overhead. Batching 5 to 10 tokens or waiting up to 100ms reduces overhead while keeping the response feeling snappy. Test with your actual LLM and network conditions to find the sweet spot.

Can I use SSE with authentication?

Yes. Authenticate the initial HTTP request as normal using cookies, bearer tokens, or other mechanisms. The authentication happens before the stream starts, and the connection inherits the authenticated context. The long-lived connection itself isn’t re-authenticated.

What happens if the LLM stops generating tokens?

Your async enumerable stops yielding, and the await foreach loop exits. You send a final completion message to signal completion to the client. If the LLM times out or errors, catch the exception and send an error message before closing the stream.

Leave a Reply

Your email address will not be published. Required fields are marked *