Streaming AI Responses: Building Real-Time Chat UIs with Vercel AI SDK

Why Streaming Matters for AI UX

Most AI demos you see online share the same flaw: you type a prompt, hit enter, and stare at a blank screen for a few seconds. Then—suddenly—a full answer appears.

Technically, that works. Experientially, it feels slow, brittle, and “offline”.

When we build AI products for clients, we want them to feel:

Instant: something happens immediately after the user takes an action.
Alive: the model types back in real time, like a human.
Trustworthy: users can see the reasoning unfold instead of waiting for a mysterious wall of text.

Token-level streaming gives us all three. In this post, I’ll walk through how we use Vercel AI SDK to build real-time chat UIs on top of Next.js.

What “Streaming” Actually Means

Under the hood, streaming is just sending data in chunks over a single HTTP response, instead of waiting to generate the full payload first.

For AI chat:

The client sends a prompt and conversation history.
The server calls the model with streaming enabled.
As tokens arrive from the model, the server forwards them to the client.
The client renders those tokens as they arrive—like a live typing effect.

Done well, this:

Reduces perceived latency dramatically.
Lets users interrupt the model mid-response.
Enables progressive enhancement (e.g. show citations only once the answer is done).

The Vercel AI SDK wraps all of this in a high-level API so you don’t need to hand-roll Server-Sent Events or WebSockets.

The Stack We’ll Use

For a modern AI chat experience, this is our go-to stack:

Framework: Next.js App Router (Edge runtime where possible)
AI SDK: Vercel AI SDK (ai) with @ai-sdk/openai (or other providers)
Frontend: React + useChat hook from ai/react
Hosting: Vercel (or any platform that supports streaming responses)

You can adapt the same patterns to other React frameworks, but App Router + Vercel AI SDK gives you the cleanest developer experience today.

Server: Streaming Responses with Vercel AI SDK

Let’s start on the server. We’ll build a minimal /api/chat endpoint that:

Accepts a list of messages from the client.
Streams back tokens from the model as they’re generated.

// app/api/chat/route.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

export const runtime = 'edge';

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai('gpt-4o-mini'),
    messages,
    temperature: 0.7,
    maxTokens: 512,
  });

  // This returns a streamed HTTP response compatible with `useChat`
  return result.toDataStreamResponse();
}

A few important details here:

runtime = 'edge': this runs your route on the Edge, reducing latency to your users.
streamText: this is the Vercel AI SDK primitive that handles streaming tokens from the provider.
toDataStreamResponse(): converts the stream into a response the frontend can consume.

You don’t have to think about SSE headers, reconnection, or chunk parsing—the SDK handles it.

Client: Building a Real-Time Chat UI

On the client, we’ll use the useChat hook from ai/react. It manages:

Sending messages to your /api/chat route.
Maintaining the full message history.
Streaming partial responses as they arrive.

// app/chat/page.tsx
import { useChat } from 'ai/react';

export default function ChatPage() {
  const {
    messages,
    input,
    handleInputChange,
    handleSubmit,
    isLoading,
    stop,
  } = useChat({
    api: '/api/chat',
  });

  return (
    <div className="flex flex-col h-[calc(100vh-80px)] max-w-2xl mx-auto">
      <div className="flex-1 overflow-y-auto space-y-4 p-4">
        {messages.map((message) => (
          <div
            key={message.id}
            className={
              message.role === 'user'
                ? 'flex justify-end'
                : 'flex justify-start'
            }
          >
            <div
              className={
                message.role === 'user'
                  ? 'rounded-2xl bg-zinc-900 text-white px-4 py-2 max-w-[80%]'
                  : 'rounded-2xl bg-zinc-100 text-zinc-900 px-4 py-2 max-w-[80%]'
              }
            >
              {message.content}
            </div>
          </div>
        ))}
        {isLoading && (
          <div className="text-xs text-zinc-500 px-4">Model is thinking…</div>
        )}
      </div>

      <form
        onSubmit={handleSubmit}
        className="border-t border-zinc-200 p-3 flex gap-2"
      >
        <input
          className="flex-1 rounded-full border border-zinc-200 px-4 py-2 text-sm focus:outline-none focus:ring-2 focus:ring-zinc-900"
          value={input}
          onChange={handleInputChange}
          placeholder="Ask me anything about your data..."
        />
        <button
          type="submit"
          disabled={isLoading || !input.trim()}
          className="rounded-full bg-black text-white px-4 py-2 text-sm disabled:opacity-40"
        >
          Send
        </button>
        {isLoading && (
          <button
            type="button"
            onClick={stop}
            className="text-xs text-zinc-500"
          >
            Stop
          </button>
        )}
      </form>
    </div>
  );
}

The magic is that messages updates live as tokens stream in. You don’t need to manually append partial chunks—the hook does that for you.

UX Patterns That Make Streaming Feel Premium

Once you have basic streaming working, the real value comes from polishing the experience:

Immediate feedback on submit
- Disable the input and show a subtle “thinking…” state.
- Echo the user’s message into the chat instantly (don’t wait for the server).
Typing effect without jank
- Render streamed content as plain text; avoid heavy re-layouts (e.g. don’t re-measure heights on each token).
- Use a CSS caret or subtle pulse instead of expensive JS animations.
Auto-scroll with respect for the user
- Auto-scroll to bottom only if the user is already near the bottom.
- If they scroll up to inspect an earlier message, stop auto-scrolling.
Interruptibility
- Expose a clear “Stop generating” action (stop() from useChat).
- Keep partial responses; users often find them useful even if incomplete.

These touches make the model feel less like a black box and more like a collaborative partner.

Handling Errors and Edge Cases

Real users will hit flaky networks, rate limits, and provider hiccups. Plan for it up front.

Timeouts & Cancellations

On the client, useChat exposes a stop function. On the server, the request comes with an AbortSignal you can pass to the model call so it stops cleanly when the user cancels.

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai('gpt-4o-mini'),
    messages,
    abortSignal: req.signal,
  });

  return result.toDataStreamResponse();
}

If the user navigates away or closes the tab, the request is aborted and the model call is cancelled—saving you tokens and money.

Provider Errors

Wrap your handler in basic error handling and return structured error messages that your UI can represent nicely:

export async function POST(req: Request) {
  try {
    const { messages } = await req.json();

    const result = streamText({
      model: openai('gpt-4o-mini'),
      messages,
    });

    return result.toDataStreamResponse();
  } catch (error) {
    console.error('Chat error', error);
    return new Response(
      JSON.stringify({ error: 'Something went wrong. Please try again.' }),
      { status: 500, headers: { 'Content-Type': 'application/json' } }
    );
  }
}

On the client, you can show a non-blocking toast or inline error bubble instead of crashing the whole chat.

When to Stream vs. When Not To

Streaming is powerful, but it’s not always the right default.

You should stream when:

Responses are long and exploratory (chatbots, copilots, assistants).
You care about perceived speed and continuous engagement.
Users might want to interrupt or steer the answer mid-way.

You can skip streaming and use simple JSON responses when:

The output is short and binary (e.g. “is this email spam?”).
You’re building background jobs or webhooks, not UI.
You need a fully-formed object (e.g. structured JSON) before rendering anything.

The good news: Vercel AI SDK supports both with nearly identical APIs, so you can start simple and upgrade to streaming where it moves the needle.

Bringing It All Together

Streaming isn’t just a technical upgrade—it’s a product decision. It changes how fast your app feels, how much users trust it, and how likely they are to stick around.

With Next.js App Router and the Vercel AI SDK, we can:

Implement streaming in a few lines of server code.
Wire up a real-time chat UI with a single hook.
Focus our energy on UX details instead of protocol plumbing.

If you’re building an AI product today and your UI still blocks on full responses, you’re leaving a lot of perceived performance on the table. Start small: stream one endpoint, ship a simple chat UI, and feel the difference immediately.