惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

P
Privacy International News Feed
Martin Fowler
Martin Fowler
D
Docker
Y
Y Combinator Blog
云风的 BLOG
云风的 BLOG
U
Unit 42
T
Tailwind CSS Blog
J
Java Code Geeks
G
Google Developers Blog
MongoDB | Blog
MongoDB | Blog
阮一峰的网络日志
阮一峰的网络日志
WordPress大学
WordPress大学
月光博客
月光博客
大猫的无限游戏
大猫的无限游戏
美团技术团队
F
Fortinet All Blogs
N
News and Events Feed by Topic
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Hacker News - Newest:
Hacker News - Newest: "LLM"
The GitHub Blog
The GitHub Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Recorded Future
Recorded Future
N
Netflix TechBlog - Medium
Google DeepMind News
Google DeepMind News
Hacker News: Ask HN
Hacker News: Ask HN
L
LINUX DO - 最新话题
Microsoft Security Blog
Microsoft Security Blog
N
News and Events Feed by Topic
I
Intezer
TaoSecurity Blog
TaoSecurity Blog
NISL@THU
NISL@THU
小众软件
小众软件
博客园 - 聂微东
博客园 - Franky
有赞技术团队
有赞技术团队
P
Palo Alto Networks Blog
爱范儿
爱范儿
H
Hacker News: Front Page
C
Cyber Attacks, Cyber Crime and Cyber Security
C
Cisco Blogs
P
Proofpoint News Feed
I
InfoQ
Google DeepMind News
Google DeepMind News
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Vercel News
Vercel News
H
Heimdal Security Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
量子位

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Building AI APIs With Node.js
Nazar Boyko · 2026-06-16 · via DEV Community

Here's an endpoint that looks completely fine:

routes/chat.ts

import OpenAI from "openai";

const openai = new OpenAI();

app.post("/api/chat", async (req, res) => {
  const completion = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: req.body.messages,
  });
  res.json(completion.choices[0].message);
});

It compiles. It works in the demo. Your PM clicks the button, the answer shows up a few seconds later, everyone claps.

Then it ships, and the cracks show up one at a time. Users stare at a spinner for eight seconds because nothing streams. A rate-limit blip on OpenAI's side turns into a 500 on your side. Finance asks how much the feature costs per user and nobody can answer. And one day a request quietly hangs for ten minutes because that's the SDK's default timeout and nobody changed it.

None of those are AI problems. They're backend problems wearing an AI costume. The model call is the easy 10%. The other 90% is the same work you'd do wrapping any flaky, expensive, slow upstream service. You've just never had an upstream that bills you per word and answers one token at a time.

This is about that 90%: streaming the response to the browser through your own server, making retries actually trustworthy, and tracking tokens so the numbers mean something. Code's in TypeScript with the official openai SDK, but the ideas port to any runtime.

The Model Call Is An Upstream Service, Treat It Like One

Before any of the fancy stuff, internalize one thing: openai.chat.completions.create() is an HTTP call to a server you don't control. It can be slow. It can rate-limit you. It can return a 500. It can hang. Every instinct you've built wrapping payment gateways and third-party APIs applies here.

The SDK gives you two surfaces. The older chat.completions.create(), the one everybody knows, and the newer responses.create(), the Responses API, which OpenAI now recommends for new work because it was designed around streaming and tool calls from the start and gives you typed, semantic events instead of raw deltas. I'll show both where they differ, because most existing code is still on Chat Completions and you'll meet it in the wild.

Start the client once, not per request:

lib/openai.ts

import OpenAI from "openai";

// Reads OPENAI_API_KEY from the environment by default.
export const openai = new OpenAI({
  timeout: 30_000,   // 30s, not the 10-minute default — more on that below
  maxRetries: 2,     // this is also the default; being explicit documents intent
});

Two options on that constructor quietly decide how your API behaves under stress. Let's earn the right to set them by understanding what they do.

Stream The Answer, Don't Make People Wait For It

The single biggest perceived-quality win for an AI feature isn't a better model. It's streaming. A response that starts appearing in 300ms feels faster than one that lands complete in 3 seconds, even though the streamed one finishes later. You're trading total time for time-to-first-token, and humans care far more about the second number.

Under the hood, when you ask for a stream the API doesn't hand you JSON. It opens a text/event-stream and pushes server-sent events: data-only SSE frames, one small chunk at a time, until it sends a terminal marker. The SDK wraps that raw stream in an async iterable so you can just loop over it.

With Chat Completions:

Streaming chat completions

const stream = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages,
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}

The Responses API does the same thing with named events instead of you fishing through choices[0].delta:

Streaming the Responses API

const stream = await openai.responses.create({
  model: "gpt-4o-mini",
  input: question,
  stream: true,
});

for await (const event of stream) {
  if (event.type === "response.output_text.delta") {
    process.stdout.write(event.delta);
  }
}

That event.type switch is the whole pitch for the Responses API: you get response.output_text.delta for text, separate events for tool calls, and a terminal response.completed event, instead of one undifferentiated firehose you have to pattern-match by hand.

Relaying The Stream Through Your Own Server

Here's the part the tutorials skip. You almost never want the browser talking to OpenAI directly: your API key would be sitting in client code, and you'd have no place to enforce auth, rate limits, or logging. So your Node server sits in the middle: it consumes the OpenAI stream and re-emits it to the browser as its own SSE stream. A relay race, where your server is the runner in the middle who never gets to stop.

Architecture diagram: the Node API sits between the browser and OpenAI, consuming the upstream token stream and re-emitting SSE frames while enforcing auth, logging, and token accounting, never buffering the whole response.

routes/chat.ts - SSE relay

app.post("/api/chat", async (req, res) => {
  // 1. Open an SSE response to the browser.
  res.writeHead(200, {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    Connection: "keep-alive",
  });

  // 2. Consume the upstream stream.
  const stream = await openai.responses.create({
    model: "gpt-4o-mini",
    input: req.body.messages,
    stream: true,
  });

  try {
    for await (const event of stream) {
      if (event.type === "response.output_text.delta") {
        // 3. Re-emit each chunk as our own SSE frame.
        res.write(`data: ${JSON.stringify({ text: event.delta })}\n\n`);
      }
    }
    res.write("data: [DONE]\n\n");
  } catch (err) {
    res.write(`data: ${JSON.stringify({ error: "stream_failed" })}\n\n`);
  } finally {
    res.end();
  }
});

Three things in there are non-negotiable in production and easy to forget:

The client-disconnect case. If the user closes the tab halfway through a long answer, your for await loop keeps pulling tokens from OpenAI, and you keep paying for them. Listen for req.on("close", ...) and abort the upstream request (the SDK supports an AbortController via the signal option) so a bored user doesn't run up your bill.

The error mid-stream case. Once you've sent 200 OK and started writing frames, you can't suddenly send a 500: the headers are already gone. So errors that happen after the first byte have to be communicated inside the stream, as a data: frame your client knows how to interpret. That catch block isn't optional politeness; it's the only way to tell the browser something broke.

The flush case. Behind a reverse proxy or some compression middleware, your tiny SSE frames can get buffered until they're "worth" sending, which defeats the entire point of streaming. Disable compression on this route and make sure nothing between you and the user is holding chunks hostage.

Retries: The SDK Already Does More Than You Think (And Less)

Now the unglamorous reliability work. Good news first: the SDK retries for you. By default it retries failed requests 2 times, with a short exponential backoff, on exactly the errors that are worth retrying: connection errors, 408 Request Timeout, 409 Conflict, 429 Rate Limit, and any 5xx. It reads the Retry-After header when one's present instead of guessing. You can tune or kill that behavior:

Tuning retries

// Globally, on the client:
const openai = new OpenAI({ maxRetries: 3 });

// Or per request, when one call deserves different treatment:
await openai.responses.create(
  { model: "gpt-4o-mini", input },
  { maxRetries: 5 },
);

That covers transient failures better than the hand-rolled try/catch most people would write. So where's the catch?

The catch is that retries and streaming don't mix the way you'd hope. The automatic retry happens during connection setup, before the first byte arrives. Once a stream has started flowing and dies in the middle, the SDK can't transparently retry it, because it would have to replay the half-delivered response. Half a token stream is gone. If resilience mid-stream matters to you, you own that: catch the error, and either restart the whole generation or accept the partial answer. There's no free lunch on a connection that's already talking.

The second catch is subtler and more dangerous. Retries are only safe on idempotent operations, and an LLM call usually isn't one, especially once it can call tools. If your model invocation triggers a tool that charges a card or sends an email, an automatic retry on a 409 or a timeout can fire that side effect twice. The request "failed" from the SDK's point of view, but the tool already ran. And the SDK won't save you here: it doesn't send an idempotency key automatically. There's an optional idempotencyKey request option, but you have to set it yourself, so nothing dedupes your retries unless you wire it up. The rule from regular backend work holds exactly: make the side effects idempotent, or don't let them auto-retry. Streaming and tools make this easy to forget; the bill and the duplicate emails will remind you.

Warning
The SDK's default request timeout is 10 minutes (600,000 ms). That's a sane default for a batch job and a terrible one for a user-facing endpoint. A wedged request will hold a connection, a worker slot, and the user's patience for ten full minutes before giving up. Set timeout to something humane (20-60s for interactive calls) on day one. When a request does time out, the SDK throws APIConnectionTimeoutError and, yes, retries it twice by default.

Token Tracking: The Number That's Null Until It Isn't

Every call costs money measured in tokens, split into input (your prompt) and output (the model's answer). If you want per-user cost, per-feature cost, or just an alert before someone's runaway loop spends your quarterly budget, you have to capture token usage on every call. This is where streaming sets a trap.

On a normal, non-streamed call, usage is right there in the response:

Usage on a non-streamed call

const res = await openai.responses.create({ model: "gpt-4o-mini", input });
console.log(res.usage);
// { input_tokens, output_tokens, total_tokens }

Easy. Now stream the same call and reach for chunk.usage, and you'll find it's null. On every chunk. The counterintuitive bit that bites everyone exactly once: when you stream Chat Completions, usage isn't reported by default at all, and even when you turn it on, it lives only on a final extra chunk sent after the content is done, a chunk whose choices array is empty. You have to opt in:

Getting usage out of a Chat Completions stream

const stream = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages,
  stream: true,
  stream_options: { include_usage: true }, // <- the line everyone forgets
});

let usage;
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) res.write(`data: ${JSON.stringify({ text: delta })}\n\n`);

  // usage is null on every chunk EXCEPT the final one.
  if (chunk.usage) usage = chunk.usage;
}

// Now you can log it.
await recordUsage({ userId, model: "gpt-4o-mini", usage });

Timeline of streamed chunks: usage reads null on every content chunk until a final chunk carries input, output, and total token counts alongside an empty choices array, shown only with stream_options.include_usage.

The Responses API is friendlier here: its terminal response.completed event carries the finished response object, usage block included. But the underlying truth is the same: usage is a property of the completed generation, not of any individual token, so it can't show up until the stream is over. Once you've got that mental model, the nulls stop being surprising.

What you do with the numbers is the actual feature:

lib/usage.ts - turning tokens into money and limits

// Prices are per 1M tokens and change often — keep them in config, not code.
const PRICING = {
  "gpt-4o-mini": { input: 0.15, output: 0.60 }, // example shape, not live rates
};

export async function recordUsage({ userId, model, usage }) {
  const p = PRICING[model];
  const cost =
    (usage.input_tokens / 1_000_000) * p.input +
    (usage.output_tokens / 1_000_000) * p.output;

  await db.usage.insert({ userId, model, ...usage, cost, at: new Date() });

  // Cheap guardrail: stop a runaway user before the invoice does.
  const spentToday = await db.usage.sumCostSince(userId, startOfDay());
  if (spentToday > DAILY_LIMIT) {
    throw new SpendLimitError(userId);
  }
}

Don't hardcode prices in the middle of business logic: they change, and you don't want a deploy every time they do. And log the raw token counts, not just the dollar figure: when you switch models or renegotiate pricing, you'll want to re-cost history, and you can only do that if you kept the tokens.

Putting It Together

Strip away the AI and what's left is a checklist you already know how to read. Treat the model as a flaky upstream and give it a real timeout. Stream through your own server so you control auth, logging, and the bill, then handle disconnects and mid-stream errors, because those will happen. Lean on the SDK's built-in retries, but remember they stop at the first byte and that retrying a tool-calling request can double a side effect. Capture token usage on every call, knowing it only shows up at the end of a stream and only if you ask for it.

The demo endpoint at the top of this post isn't wrong, exactly. It's just unfinished. It's the 10%. The gap between that and something you'd put your name on is ordinary, careful backend engineering. The model is the new part. Everything that makes it survive contact with real users is work you've done a hundred times before.


Originally published at nazarboyko.com.