惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

I Added a 4th Agent That Audits My Other Agents. It Caught My Strategist Procrastinating for 3 Weeks. A prompt is not a conversation. It's a component contract. How to Pass the EAA 2025 Accessibility Audit — A Step-by-Step WCAG Checklist Building an Autonomous MCP Lead Generation System with Hermes Agent LangGraph 워크플로우 템플릿 (v40) How I Built 100 Browser-Based Image Tools With No Server (FFmpeg WASM, PDF-lib, AI Background Removal) Nginx CVE-2026-9256, AI Prompt Injection Defenses, and Claude AI Data Leak Demo Scaling RAG for 10M+ Docs, .md Agent Memory, & Claude Code for Motion Graphics Diagram as Code with draw.io DuckDB Delta, PostgreSQL 17 Migration, & SQLite Optimization Deep Dives Windows 11 Microsoft Account Login Recovery During Internet Restrictions The Linux Commands You Forgot Exist (And Why AI Workflows Make Them Relevant Again) Spec-Driven Development Without an IDE: I Generated NestJS, Go, Spring Boot, Laravel, and Rust Apps From a Single PRD File Components are states Edge SEO y Middleware: Cómo Interceptar a Googlebot y LLMs antes de llegar a tu Servidor Context window exceeded at turn 23. Here's how I track token usage without a tokenizer. My Hermes agent spent $3 before I noticed. Now it can't. My Hermes agent's stop condition was a 40-line if/elif chain. I replaced it with 3 lines. My agent kept hitting context limits. This one function fixed it. Create and configure Azure Firewall Your Hermes agent's audit log is leaking customer emails. Here's a 100-line lib that fixes that. My agent kept forgetting what it was doing. A scratchpad fixed it. I replaced 200 lines of ad-hoc state management in my Hermes agent with one object. Per-Key Rate Limiting for Agent Tool Calls: Stop One User From Breaking Everything Composable Output Guardrails: Filter Agent Responses Before They Reach Users Sanitize Your LLM Message Lists Before Every API Call Thread a Run ID Through Every Agent Call So You Can Debug Anything Normalize Provider Error JSON So Your Agent Can Actually Handle Failures Priority Queue for Agent Sub-Tasks: Stop Processing Low-Priority Work First Static Lint Rules for Your LLM Prompts (Before They Hit Production) tool-call-budgets: Stop Runaway Agent Loops Before They Hit Your Invoice Step Through Your Agent's Failures Like a Debugger The Simplest Stop Condition: A Hard Cap on Agent Loop Iterations Score Your Agent's Responses With a 0.0-1.0 Rubric (No LLM Judge Required) Fix Bad Structured Output by Feeding the Error Back to the Model Building an effective Storyblok Tool Plugin with SvelteKit How to Get Your Renault / Dacia Radio Code for Free RAG 시스템 실전 구축 (v39) Retraction — scrml’s Living Compiler I built a fitness app where the AI roasts you for eating pizza (and hypes you when you PR) The Top SaaS Founder Communities on Discord (Beyond the AI Hype) I Built a Production-Grade Async Job Queue from Scratch — Here's Everything That Actually Happened How to watch SMS from multiple Android phones in one iOS app We Didn’t Want Another AI Wrapper — So We Explored a High-Speed Hermes Orchestrator for Engineering Crews Multi-tenant além do TenantId: problemas reais e aprendizados em sistemas .NET After failing 23 times, I am sharing How I Actually Prepare for a Tech Interview Every Single Time Now. I built an app that works like a nutritionist for your brain. Here's what happened in 7 days. GoBadge Dynamic: From Module Stats to Universal Badges LangGraph 워크플로우 템플릿 (v39) The git Commands You Forgot Exist (And Why AI Workflows Make Them Relevant Again) Six Levels of MCP Servers One container to replace Grafana + Loki + Tempo + Prometheus The Request/Response Cycle, HTTP, Auth, JWT, OAuth & Sessions — Explained Properly Python Week 3: We Stopped Repeating Ourselves (Loops!) Creating a Custom Grid Editor tool in Unreal Engine 我做了个付费 Telegram bot。Telegram Stars 实际给开发者多少钱,我算了一笔账。 I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python A practitioner's guide to getting more value out of AI coding: agent quality & token optimization How to Handle Telegram Albums in Telegraf I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages How to Handle Telegram Albums in grammY RAG 시스템 실전 구축 (v38) Beyond Pip Install: Why Your AI Agent Needs a "Hermetic" Life-Support System to Survive Resume Building using HTML & CSS SpecFlow: Multi-Agent SDD in Cursor (4 phases, /approve, single code writer) Running ASR for smart homes in the NPU of Intel processors "Building a CI/CD Pipeline From Scratch: A Practical Guide for Developers (with GitHub Actions)" SpecFlow: SDD multi-agente en Cursor (4 fases, /approve, un solo escritor de código) How to Extract Your Full Team Hierarchy from HubSpot (the API doesn't expose it) Adobe Commerce Cloud now costs $40k/year. We migrated from Adobe Commerce to Magento Open Source — here's the honest breakdown .klickd v4.0.0 — Portable AI memory with constraints, strict schemas, and test vectors We Trust Third Party Code, It’s Time to Trust AI Generated Code LangGraph 워크플로우 템플릿 (v38) Sustainable AI Starts with Efficient AI Find Remove duplicated files in Google Drive How to Detect GPU Waste in a Kubernetes Cluster The Privacy Bug in My First Chrome Extension (And How to Avoid It) Serverless Mental Models: What They Don't Tell You Before You Build Preventing GPT hallucination in automated content pipelines: how I structure Make.com flows with data injection Hmm, where were we? AI Visibility Tools, Math Proofs, and Stripped Guardrails Shape Developer Landscape How AI and Electronics Are Changing Healthcare Devices: The Future of Smart Healthcare Author: Shivam Wakade | Founder, PrivSR Making Claude Sound Like Optimus Prime Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions Learning Progress Pt.20 How Secure LoRa Communication Devices Work: Building the Future of Private and Long-Range Connectivity Author: Shivam Wakade | Founder, PrivSR How I Rebuilt an RPG Map Editor with Rust, React, and WASM Building a System That Automates YouTube Post-Production Building a 100% Serverless Digital Asset Packager in the Browser Game Recommended AI What is Human-In-The-Loop (HITL)? Deep Dive: React Server Components in TanStack Start Migrating off Google Analytics: Umami vs Plausible vs Fathom Building a Portfolio That Actually Demonstrates Software Engineering Async/Await in JavaScript: From Callbacks to Clean Code (2026) Benchmarking LLM Structured Outputs Angular 21 Multiselect Dropdown: A Migration-Friendly Component with Live Functional Tests ShareBox v5 — GPU transcoding, Netflix-style grid, and why I don't need Plex anymore TOML Schema is live Handling Duplicate Shopify Webhook Events (And Why You Must)
Streaming LLM responses to the browser in Go (Server-Sent Events)
Ayi NEDJIMI · 2026-05-26 · via DEV Community

Ayi NEDJIMI

The biggest UX mistake in LLM-powered web apps is waiting for the complete response before sending anything. On a 400-token answer at typical generation speeds, that's 4–8 seconds of staring at a spinner. With streaming, the user sees the first word in under a second and reads along as the model generates. This tutorial shows you exactly how to implement token-by-token streaming from an LLM API to the browser using Server-Sent Events (SSE) in Go Fiber.

Why SSE and not WebSockets?

WebSockets are bidirectional. For LLM streaming, you don't need that — you send one request, the server pushes tokens back. SSE is:

  • Unidirectional (server → client), which fits the problem exactly
  • A plain HTTP/1.1 connection with text/event-stream content type
  • Automatically reconnectable by the browser's EventSource API
  • Proxied by Nginx without special configuration (unlike WebSocket upgrades)

The wire format is dead simple:

data: {"token": "Hello"}\n\n
data: {"token": " world"}\n\n
data: [DONE]\n\n

Enter fullscreen mode Exit fullscreen mode

Each event is data: <payload>\n\n. The double newline is the event terminator.

The common mistake: buffering the full response

Here's what not to do:

// BAD: collects full LLM response then sends it
func badHandler(c *fiber.Ctx) error {
    fullResponse := callLLMAndWaitForCompletion(c.Query("q"))
    return c.JSON(fiber.Map{"response": fullResponse})
    // User waits 6 seconds. Sees response instantly. Still worse UX.
}

Enter fullscreen mode Exit fullscreen mode

Even if you send it "instantly" after receiving it, the user waited the full generation time. Buffering eliminates the perceived speed advantage of fast models.

Setup

go get github.com/gofiber/fiber/v2
go get github.com/openai/openai-go  # or any OpenAI-compatible SDK

Enter fullscreen mode Exit fullscreen mode

The SSE handler

// handlers/stream.go
package handlers

import (
    "bufio"
    "context"
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "strings"
    "time"

    "github.com/gofiber/fiber/v2"
    openai "github.com/openai/openai-go"
    "github.com/openai/openai-go/option"
)

type StreamHandler struct {
    llmClient *openai.Client
    model     string
}

func NewStreamHandler(apiKey, baseURL, model string) *StreamHandler {
    client := openai.NewClient(
        option.WithAPIKey(apiKey),
        option.WithBaseURL(baseURL),
    )
    return &StreamHandler{llmClient: client, model: model}
}

// sseEvent writes a single SSE event to the response writer.
func sseEvent(c *fiber.Ctx, data string) error {
    _, err := fmt.Fprintf(c.Response().BodyWriter(), "data: %s\n\n", data)
    return err
}

func (h *StreamHandler) StreamCompletion(c *fiber.Ctx) error {
    query := strings.TrimSpace(c.Query("q", ""))
    if query == "" {
        return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{
            "error": "query parameter 'q' is required",
        })
    }
    if len([]rune(query)) > 1000 {
        return c.Status(fiber.StatusBadRequest).JSON(fiber.Map{
            "error": "query too long (max 1000 characters)",
        })
    }

    // Set SSE headers before writing any body
    c.Set("Content-Type", "text/event-stream")
    c.Set("Cache-Control", "no-cache")
    c.Set("Connection", "keep-alive")
    c.Set("X-Accel-Buffering", "no") // Critical for Nginx: disables proxy buffering

    // Use the request context so the stream is cancelled if the client disconnects
    ctx, cancel := context.WithTimeout(c.Context(), 60*time.Second)
    defer cancel()

    stream := h.llmClient.Chat.Completions.NewStreaming(ctx,
        openai.ChatCompletionNewParams{
            Model: openai.F(h.model),
            Messages: openai.F([]openai.ChatCompletionMessageParamUnion{
                openai.SystemMessage("You are a helpful technical assistant. Be concise and accurate."),
                openai.UserMessage(query),
            }),
            MaxTokens:   openai.Int(800),
            Temperature: openai.Float(0.3),
        },
    )
    defer stream.Close()

    tokenCount := 0
    for stream.Next() {
        chunk := stream.Current()
        if len(chunk.Choices) == 0 {
            continue
        }

        token := chunk.Choices[0].Delta.Content
        if token == "" {
            continue
        }

        tokenCount++
        payload, err := json.Marshal(map[string]string{"token": token})
        if err != nil {
            continue
        }

        if err := sseEvent(c, string(payload)); err != nil {
            // Client disconnected — stop generating
            log.Printf("Client disconnected after %d tokens", tokenCount)
            return nil
        }
    }

    if err := stream.Err(); err != nil {
        // Send error event so the client knows what happened
        errPayload, _ := json.Marshal(map[string]string{
            "error": "stream interrupted: " + err.Error(),
        })
        _ = sseEvent(c, string(errPayload))
        log.Printf("Stream error after %d tokens: %v", tokenCount, err)
        return nil
    }

    // Signal clean completion
    _ = sseEvent(c, "[DONE]")
    log.Printf("Stream complete: %d tokens for query: %q", tokenCount, query)
    return nil
}

Enter fullscreen mode Exit fullscreen mode

Main server

// main.go
package main

import (
    "log"
    "os"

    "github.com/gofiber/fiber/v2"
    "github.com/gofiber/fiber/v2/middleware/cors"
    "github.com/gofiber/fiber/v2/middleware/limiter"
    "stream-api/handlers"
)

func main() {
    apiKey  := os.Getenv("LLM_API_KEY")
    baseURL := os.Getenv("LLM_BASE_URL") // e.g. "https://api.openai.com/v1"
    model   := os.Getenv("LLM_MODEL")    // e.g. "gpt-4o-mini"

    streamHandler := handlers.NewStreamHandler(apiKey, baseURL, model)

    app := fiber.New(fiber.Config{
        // Disable response buffering — critical for SSE
        StreamRequestBody: true,
    })

    app.Use(cors.New())

    // Rate limit: 10 requests per minute per IP
    app.Use("/api/stream", limiter.New(limiter.Config{
        Max:        10,
        Expiration: 60,
    }))

    app.Get("/api/stream", streamHandler.StreamCompletion)

    log.Fatal(app.Listen(":4001"))
}

Enter fullscreen mode Exit fullscreen mode

JavaScript client

This is the complete frontend implementation. No libraries needed — the browser's native EventSource API handles reconnection automatically.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>LLM Stream Demo</title>
    <style>
        body { font-family: monospace; max-width: 800px; margin: 40px auto; padding: 0 20px; }
        #output { white-space: pre-wrap; background: #f5f5f5; padding: 16px;
                  border-radius: 4px; min-height: 60px; }
        #status { color: #888; font-size: 0.85em; margin-top: 8px; }
        button { margin-top: 12px; padding: 8px 16px; cursor: pointer; }
        button:disabled { opacity: 0.5; cursor: not-allowed; }
    </style>
</head>
<body>
    <h2>LLM Streaming Demo</h2>
    <input type="text" id="query" placeholder="Ask something..." style="width:100%;padding:8px">
    <button id="btn" onclick="startStream()">Ask</button>
    <button id="stop-btn" onclick="stopStream()" disabled>Stop</button>
    <div id="output"></div>
    <div id="status"></div>

<script>
let currentSource = null;

function startStream() {
    const query = document.getElementById('query').value.trim();
    if (!query) return;

    // Clean up any existing stream
    stopStream();

    const output = document.getElementById('output');
    const status = document.getElementById('status');
    const btn = document.getElementById('btn');
    const stopBtn = document.getElementById('stop-btn');

    output.textContent = '';
    status.textContent = 'Connecting...';
    btn.disabled = true;
    stopBtn.disabled = false;

    const url = `/api/stream?q=${encodeURIComponent(query)}`;
    currentSource = new EventSource(url);

    let tokenCount = 0;
    const startTime = Date.now();

    currentSource.onmessage = function(event) {
        if (event.data === '[DONE]') {
            const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
            status.textContent = `Done — ${tokenCount} tokens in ${elapsed}s`;
            cleanup();
            return;
        }

        try {
            const parsed = JSON.parse(event.data);

            if (parsed.error) {
                status.textContent = `Error: ${parsed.error}`;
                cleanup();
                return;
            }

            if (parsed.token) {
                output.textContent += parsed.token;
                tokenCount++;
                status.textContent = `Generating... (${tokenCount} tokens)`;
                // Auto-scroll
                output.scrollTop = output.scrollHeight;
            }
        } catch (e) {
            console.error('Parse error:', e, 'Raw:', event.data);
        }
    };

    currentSource.onerror = function(event) {
        // EventSource fires onerror on clean close too — check readyState
        if (currentSource.readyState === EventSource.CLOSED) {
            return; // normal closure, already handled by [DONE]
        }
        status.textContent = 'Connection error. Retrying...';
        // EventSource reconnects automatically after ~3s
        // If you don't want auto-retry, call cleanup() here
    };

    currentSource.onopen = function() {
        status.textContent = 'Connected, waiting for first token...';
    };
}

function stopStream() {
    if (currentSource) {
        currentSource.close();
        currentSource = null;
    }
    cleanup();
}

function cleanup() {
    document.getElementById('btn').disabled = false;
    document.getElementById('stop-btn').disabled = true;
    currentSource = null;
}
</script>
</body>
</html>

Enter fullscreen mode Exit fullscreen mode

Nginx configuration

Add this to your Nginx server block. Without proxy_buffering off, Nginx will buffer the entire SSE stream and the user sees nothing until the response ends.

location /api/stream {
    proxy_pass         http://127.0.0.1:4001;
    proxy_http_version 1.1;
    proxy_set_header   Connection "";        # disable keep-alive pooling
    proxy_buffering    off;                  # CRITICAL for SSE
    proxy_cache        off;
    proxy_read_timeout 90s;                  # longer than your max stream duration
    proxy_set_header   X-Real-IP $remote_addr;
}

Enter fullscreen mode Exit fullscreen mode

The X-Accel-Buffering: no header in the Go handler achieves the same effect when Nginx honors it, but setting proxy_buffering off in Nginx config is the belt-and-suspenders approach.

Handling errors mid-stream

This is where SSE gets subtle. Once you've started writing the response body with text/event-stream, you cannot send an HTTP 500 status — the status line is already sent. Your error handling must happen in-band via a data event:

// In the Go handler — if LLM call fails after stream starts:
errPayload, _ := json.Marshal(map[string]string{
    "error": "rate_limit_exceeded",
    "message": "Please try again in a moment.",
})
_ = sseEvent(c, string(errPayload))
// Then return nil — the HTTP layer doesn't know an error occurred

Enter fullscreen mode Exit fullscreen mode

On the client side, check every event for an error field and handle it in onmessage, not just onerror. The onerror handler fires for connection errors (network drop, server restart), not application-level errors embedded in the stream.

Performance notes

At 1,000 concurrent users each holding an SSE connection, you're holding 1,000 goroutines open. Go goroutines are cheap (4KB stack by default), so this is fine up to tens of thousands of connections on a modest server. The bottleneck will be your LLM API rate limits, not the SSE infrastructure.

Use the context.WithTimeout cancel to ensure goroutines don't leak if the LLM API hangs. The defer cancel() in the handler guarantees cleanup even if the client disconnects before [DONE].

This pattern — SSE in Fiber, EventSource in the browser, no-buffer Nginx config — is production-ready and requires zero additional dependencies beyond what a standard Go web API already uses.