Frontend Architecture · AI · SSE

Streaming Agent UI Without the Chatbot Clipart

Animesh Pandey · May 31, 2026 · 13 min read

Teams add a chat bubble to their product and call it an AI feature. Users click it once, get a spinner for seven seconds, then go back to whatever they were doing. The problem isn't the model - it's the gap between what the model is doing and what the UI is showing. An opaque wait is indistinguishable from a broken request.

I spent the better part of the past year building streaming agent UI on a multi-tenant analytics platform - the kind where users have real work to do and real deadlines, and the AI is supposed to help them do it faster, not give them something new to babysit. What I learned is that the UI contract for a streaming agent is almost nothing like a standard request-response interaction, and reaching for a generic chat SDK without thinking about that contract produces exactly the spinner-and-wait experience that erodes trust.

This article covers what the contract actually looks like, why you almost certainly want a backend proxy between your frontend and the model, and how to implement the streaming hook and UX state machine that makes the AI feel like part of the product rather than a bolt-on.

Why the browser shouldn't call the model directly

The short answer: API keys, timeouts, and chunked responses. Browser-to-model calls require you to expose a credential in client-side code or to build a token-exchange flow - either of which is more overhead than just using a backend proxy. But the more interesting reason is timeouts. Most AI completions for complex planning tasks take 15–90 seconds. Browsers aggressively kill fetch connections beyond their timeout budgets, and long-running client-side streams can be severed by network transitions, tab backgrounding, or prefetch heuristics.

A thin backend proxy - a Node.js Express route is 40 lines - solves all of this. It holds the upstream connection to the model, buffers and forwards the SSE stream to the browser, handles retry on upstream error, and enforces a sane timeout on the model call without letting it propagate as a broken pipe to the browser. The proxy is also where you add request authentication, context injection, and model versioning without shipping those details to every client.

Browser (React)          Node BFF Proxy           Model API
     │                        │                       │
     │  POST /api/stream       │                       │
     │ ──────────────────────► │                       │
     │                        │  POST /completions     │
     │                        │ ──────────────────────►│
     │                        │                       │
     │   event: status        │   chunk: token...      │
     │ ◄────────────────────── │ ◄──────────────────── │
     │   event: delta         │   chunk: token...      │
     │ ◄────────────────────── │ ◄──────────────────── │
     │   event: tool_call     │   chunk: [tool use]    │
     │ ◄────────────────────── │ ◄──────────────────── │
     │   event: delta         │   chunk: token...      │
     │ ◄────────────────────── │ ◄──────────────────── │
     │   event: done          │   [DONE]               │
     │ ◄────────────────────── │ ◄──────────────────── │

Figure 1: The BFF proxy holds the upstream model connection and forwards typed SSE events to the browser. The browser never holds a direct socket to the model API.

Designing the SSE event contract

Server-sent events are a plain-text streaming protocol. Each event is separated by two newlines; the data: field carries a JSON payload. The spec is simple enough that you don't need a library - EventSource handles reconnection automatically, but for POST-based streams (required for sending a prompt body) you need fetch with ReadableStream parsing. More on that in a moment.

The event schema I've converged on after multiple iterations looks like this:

type StreamEvent =
  | { type: 'status';    phase: 'thinking' | 'tool' | 'typing' }
  | { type: 'delta';     text: string }
  | { type: 'tool_call'; name: string; input?: Record<string, unknown> }
  | { type: 'done' }
  | { type: 'error';     message: string; retryable: boolean };

The status event drives the thinking indicator. delta events append text to the response buffer. tool_call events render inline tool-call panels - the name and summarised input are safe to show users; the full payload stays server-side. done finalises the message and clears the in-progress state. error tells the UI whether to offer a retry or surface a terminal failure.

The key constraint: every event must be safe to render immediately and independently. You can't design an event schema that requires the UI to buffer and reorder - that defeats the purpose of streaming. If your model returns tool results interleaved with completion tokens, normalise them in the proxy before forwarding.

The streaming hook

The hook's job is to manage the fetch lifecycle, parse the SSE byte stream, dispatch typed events to React state, and expose an AbortController signal the UI can use to cancel mid-stream.

async function consumeSSE(
  response: Response,
  onEvent: (e: StreamEvent) => void,
  signal: AbortSignal
): Promise<void> {
  const reader = response.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = '';

  while (true) {
    const { value, done } = await reader.read();
    if (done || signal.aborted) break;

    buffer += decoder.decode(value, { stream: true });
    const parts = buffer.split('\n\n');
    buffer = parts.pop() ?? '';

    for (const part of parts) {
      const dataLine = part.split('\n').find(l => l.startsWith('data: '));
      if (!dataLine) continue;
      try {
        const evt = JSON.parse(dataLine.slice(6)) as StreamEvent;
        onEvent(evt);
      } catch { /* malformed chunk - skip */ }
    }
  }
}

This is deliberately minimal. It doesn't retry (that's the proxy's job), it doesn't reconnect (the UI owns that decision), and it doesn't buffer partial tokens into larger chunks (the consumer does what it wants with each delta). The byte-accumulation pattern - accumulate into a buffer, split on \n\n, leave the trailing incomplete event in the buffer - is the right way to handle chunked TCP delivery where event boundaries don't align with chunk boundaries.

The hook in React wraps consumeSSE with state management:

type Phase = 'idle' | 'thinking' | 'tool' | 'typing' | 'done' | 'error';

function useAgentStream() {
  const [phase, setPhase] = useState<Phase>('idle');
  const [text, setText] = useState('');
  const [toolCalls, setToolCalls] = useState<ToolCall[]>([]);
  const abortRef = useRef<AbortController | null>(null);

  async function send(prompt: string) {
    abortRef.current?.abort();
    abortRef.current = new AbortController();
    const signal = abortRef.current.signal;

    setPhase('thinking'); setText(''); setToolCalls([]);

    const response = await fetch('/api/stream', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ prompt }),
      signal,
    });

    if (!response.ok) { setPhase('error'); return; }

    await consumeSSE(response, (evt) => {
      if (evt.type === 'status')    setPhase(evt.phase);
      if (evt.type === 'delta')     setText(t => t + evt.text);
      if (evt.type === 'tool_call') setToolCalls(tc => [...tc, evt]);
      if (evt.type === 'done')      setPhase('done');
      if (evt.type === 'error')     setPhase('error');
    }, signal);
  }

  function cancel() { abortRef.current?.abort(); setPhase('idle'); }

  return { phase, text, toolCalls, send, cancel };
}

Notice the abortRef.current?.abort() at the top of send. If the user sends a second message while the first is still streaming, we cancel the in-flight request before issuing the new one. This is the pattern that prevents ghost updates - half-rendered responses from stale requests arriving after a new one starts.

UX states and why each matters

The state machine has six states, and each one requires deliberate UI treatment. Collapsing them - showing one spinner for thinking, tool execution, and slow network alike - produces an interface that users can't read or trust.

  ┌─────────┐  send()   ┌──────────┐  status:tool   ┌──────────┐
  │  idle   │ ─────────►│ thinking │ ──────────────► │   tool   │
  └─────────┘           └──────────┘                 └──────────┘
       ▲                     │                            │
       │                     │ status:typing              │ delta / status:typing
  cancel()                   ▼                            ▼
       │                ┌──────────┐                 ┌──────────┐
       └─── cancel() ── │  typing  │ ◄───────────────┤  typing  │
                        └──────────┘                 └──────────┘
                             │ done                       │ done
                             ▼                            ▼
                        ┌──────────┐                 ┌──────────┐
                        │   done   │                 │   done   │
                        └──────────┘                 └──────────┘
                             │ error (any state)
                             ▼
                        ┌──────────┐
                        │  error   │
                        └──────────┘

Figure 2: The phase state machine. Tool execution interrupts the typing phase; both paths converge at done or error. Cancel is valid from any non-idle state.

thinking: The model has received the prompt and is deciding what to do. Show a pulsing indicator with copy that acknowledges the request. Don't show a spinner - a spinner implies the system is busy in a way the user can't influence. A pulsing dot with "Analysing your question…" is honest: the system is thinking, not broken.

tool: The model is calling a tool (querying a database, running a calculation, fetching context). This is the most important state to make visible, because it explains a gap in the stream that would otherwise look like a hang. Render an inline tool-call panel showing the tool name and a summarised input. "Querying spend data for Q1 2026" is far more trustworthy than a frozen spinner. The panel is dismissible after the tool resolves.

typing: Text is actively arriving. Append tokens to the response buffer. The cursor blink (a simple CSS animation on a ::after pseudo-element) should appear only when new text is arriving - stop it on done or error.

done: The model has finished. Stop the cursor. Optionally run a subtle fade-in on any figures or code blocks that were partially constructed. Do not reflow the page - reflows after a response lands train users to look away during streaming, which defeats the purpose.

error: Distinguish between retryable errors (network interruption, upstream timeout) and terminal ones (bad prompt, quota exceeded). For retryable errors, show an inline retry button with the original prompt already populated. For terminal errors, show a plain message and a "start over" path. Never surface raw error messages from the model or proxy.

idle: The default. The entry point has a text input and a send button. Nothing else. Adding "suggested prompts" or a conversation history header before the user has sent anything increases cognitive load for zero benefit.

┌─────────────────────────────────────────────────────────────┐
│  Agent panel (fixed width drawer or inline card)            │
│                                                             │
│  ┌── Prompt input ─────────────────────────── [Send] ──┐   │
│  │  What drove the CPM spike last Thursday?             │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌── Response area ─────────────────────────────────────┐   │
│  │  [thinking]  ●  Analysing your question…            │   │
│  │              ↓                                       │   │
│  │  [tool]  ▶  querying: campaign performance (Q2)     │   │
│  │              ↓                                       │   │
│  │  [typing]  The CPM spike on Thursday correlates     │   │
│  │            with a budget exhaustion event on your   │   │
│  │            top-spend channel at 14:32 UTC. When     │   │
│  │            the primary campaign paused, secondary▌  │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                    [✕ Stop] │
└─────────────────────────────────────────────────────────────┘

Figure 3: UI wireframe during the typing phase, following a tool call. The tool panel renders inline, above the streaming text. The stop button is only visible while streaming is active.

Persistence tradeoffs

Where do you store the conversation? The honest answer is: it depends on how much you trust the user's browser and how much the conversation is worth keeping.

sessionStorage is the right default. It survives page refreshes within a tab, disappears on tab close, and doesn't bloat localStorage with conversation history the user will never revisit. It also sidesteps the multi-tab problem: two tabs with the same agent shouldn't share state.

localStorage is appropriate when conversation continuity is part of the product promise - a user who returns the next day and expects context to be retained. The cost is eviction pressure: a long session with streaming text and tool responses can easily reach several hundred kilobytes. Implement a max-turns eviction (drop the oldest turn when the stored byte size exceeds your threshold) before you ship to production.

Server-side persistence is the right answer for anything that needs to be shared, audited, or restored across devices. Implement it as a hydration endpoint: on mount, the client fetches the last N turns (or the session ID from a URL param) and populates local state. The streaming hook then appends to that hydrated state. This avoids the complexity of real-time sync and keeps the hook's state model simple.

What I'd do differently: start with sessionStorage, define a serialisation schema with a version field on day one, and add server persistence only when users actually ask for it. Premature server persistence adds a write on every token - that's a lot of round-trips for something most users won't need.

Testing streaming hooks

The hardest part of testing streaming UI is producing deterministic SSE fixtures. The approach I use: mock fetch to return a Response with a ReadableStream whose underlying source emits pre-scripted events with controllable timing.

function mockStream(events: StreamEvent[], delayMs = 0): Response {
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      for (const evt of events) {
        if (delayMs) await sleep(delayMs);
        const chunk = `data: ${JSON.stringify(evt)}\n\n`;
        controller.enqueue(encoder.encode(chunk));
      }
      controller.close();
    }
  });
  return new Response(stream, { status: 200,
    headers: { 'Content-Type': 'text/event-stream' } });
}

With this fixture factory you can write deterministic tests for every phase transition: assert that the phase moves from thinking to tool after the right event, assert that the text buffer accumulates correctly across multiple deltas, assert that abort clears state before the next send. You don't need Vitest fake timers for the happy path - the delayMs parameter lets you write timing-sensitive tests when you actually need them.

One thing worth testing explicitly: the ghost-update case. Send a first prompt, don't wait for it to complete, then send a second prompt. Assert that the text state contains only the second response's text, not an interleaving of both.

Lessons from production

Partial renders are less scary than you think. Users adapted quickly to seeing incomplete sentences stream in. What they don't adapt to is an interface that looks frozen. A visible cursor on an empty response area is better than a spinner.
Tool-call panels are the highest-trust feature. Showing "I checked X before answering" consistently outperformed opaque responses in user feedback, even when the tool call was trivial. Transparency about how the answer was derived matters more than the answer itself.
Fifteen-second timeouts feel like failures. Even with a visible thinking indicator, users start to doubt the system after about 10 seconds. Consider emitting a progress event from the proxy ("still thinking - this one takes a moment") at the 8-second mark if the upstream hasn't sent a token yet.
Cancel is underused. Make the stop button prominent. Users who know they can cancel are more willing to try complex queries - they don't feel locked in.
The mistake I'd undo: I initially made the tool-call panel collapsible by default, reasoning that users didn't need to see it after the response arrived. They did. Several users specifically told us the tool trace was the part they trusted most. Now it stays expanded.