惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Hacker News: Ask HN
Hacker News: Ask HN
C
Check Point Blog
爱范儿
爱范儿
IT之家
IT之家
aimingoo的专栏
aimingoo的专栏
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
J
Java Code Geeks
Google DeepMind News
Google DeepMind News
B
Blog RSS Feed
阮一峰的网络日志
阮一峰的网络日志
Simon Willison's Weblog
Simon Willison's Weblog
博客园 - 聂微东
C
Cisco Blogs
V
Vulnerabilities – Threatpost
AWS News Blog
AWS News Blog
Microsoft Azure Blog
Microsoft Azure Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
人人都是产品经理
人人都是产品经理
C
Cyber Attacks, Cyber Crime and Cyber Security
P
Privacy & Cybersecurity Law Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Cisco Talos Blog
Cisco Talos Blog
T
The Blog of Author Tim Ferriss
T
The Exploit Database - CXSecurity.com
博客园 - Franky
博客园 - 【当耐特】
T
Threat Research - Cisco Blogs
F
Full Disclosure
MongoDB | Blog
MongoDB | Blog
Security Latest
Security Latest
T
Threatpost
Project Zero
Project Zero
S
Schneier on Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Help Net Security
Help Net Security
WordPress大学
WordPress大学
The Register - Security
The Register - Security
S
Security Affairs
GbyAI
GbyAI
L
LINUX DO - 热门话题
The Cloudflare Blog
Stack Overflow Blog
Stack Overflow Blog
Recent Announcements
Recent Announcements
Know Your Adversary
Know Your Adversary
O
OpenAI News
P
Palo Alto Networks Blog
博客园_首页
Microsoft Security Blog
Microsoft Security Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
T
Troy Hunt's Blog

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Engineering Resilience: Two Lessons from Building Under Pressure
Afeh · 2026-06-13 · via DEV Community

A reflection on performance optimization at scale and building reliability mechanisms; two tasks that defined my internship.


Every engineering internship has its share of "aha" moments; those late-night debugging sessions where a breakthrough finally clicks, or the PR that takes seven commits to get right. As I wrap up my time as an Intern with HNG, I want to write about two tasks that stuck with me. Not because they were the hardest, but because they taught me something real about building systems that have to work.

One was individual, optimizing a demographic intelligence API to handle millions of records with sub-second query times. The other was a team effort; building reliability mechanisms into an AI-powered interview platform so that when things break (and they will), the system degrades gracefully instead of falling apart.

Lets walk through both.


Part I: Working on Insighta (Individual Task — Stage 4B)

What it was

Insighta IQ is a demographic intelligence API; think "find me Nigerian females aged 20 to 45." I nicknamed it Stereo API(Stereo short for stereotyping of course). Users query a PostgreSQL database of millions of demographic profiles through a FastAPI backend, via both CLI and web clients.

Stage 4 asked us to make it perform under serious assumptions:

  • Data growth: from millions toward tens of millions of profiles
  • Traffic: hundreds to low thousands of queries per minute, multiple teams using it daily
  • Performance targets: P50 latency under 500ms, P95 under 2 seconds

The task had three parts: query optimization, query normalization, and large-scale CSV ingestion. The CSV ingestion piece was the one that kept me up at night.

The problem

Users needed to upload CSV files with up to 500,000 rows of profile data. These weren't trivial constraints:

  • You cannot insert rows one by one (500k individual inserts = death by a thousand round trips)
  • You cannot load the entire file into memory (500k rows of profile data can easily exceed available RAM)
  • Uploads must not block ongoing query traffic (the system can't go dark during ingestion)
  • The system must handle concurrent uploads
  • A single bad row must never fail the entire upload

On top of that, the database is hosted remotely, so every query — including every INSERT, incurs network latency. And we were already under read pressure from the query workload.

How I approached it

I broke the problem into layers:

  1. Streaming, not loading: Read the file in 256KB chunks, never hold the whole thing in memory
  2. Batch processing, not row-by-row: Validate rows, accumulate 10,000 valid ones, then bulk-insert
  3. Dedicated thread pool: Run the CPU+DB-bound CSV work off the event loop so the API stays responsive
  4. Off-loading the uploading to a background task: It is better to just return a success response to let the user know that the uploading is working in the background rather than waiting endlessly for the upload to get done.
  5. Partial failure tolerance: Each batch commits independently; failures don't roll back valid work
  6. Comprehensive skip reporting: Return a structured summary of exactly what went wrong and why

Here's what the ingestion pipeline looked like:

File upload → Chunked read (256KB) → Decode (UTF-8 → Latin-1 fallback)
  → CSV DictReader (streaming) → Validate row → Batch (10k rows)
    → Deduplicate in-batch → Single DB query for existing names
      → Bulk INSERT (ON CONFLICT DO NOTHING) → Report summary

What broke and how I fixed it

The first approach was naive. I read the entire CSV into memory, parsed every row, and then tried to insert everything at once. For a 100KB test file, it worked beautifully. For the 500,000-row file? Memory exploded.

Fix: I switched to streaming with csv.DictReader over a StringIO buffer. But I still needed to validate rows against each other (duplicate names within the same upload) and against the database. That meant holding state across batches of 10,000 rows.

The second problem was intra-batch deduplication. Without it, two rows with the same name in the same upload would both pass validation, only to conflict during insert. The fix was a global_seen: Set[str]; a set of all names already inserted in this upload session, passed between batch flushes.

The third problem was edge cases in CSV data I never anticipated:

  • Rows with extra commas producing wrong column counts
  • UTF-16 encoded files that failed UTF-8 decoding
  • Age values like "twenty-five" instead of "25"
  • Country codes in varying cases ("ng", "NG", "Ng")

Each edge case got its own validator function. The _parse_row function became a gauntlet of pure functions — no database calls, just deterministic validation. If a row failed any check, it was counted and skipped, but processing continued.

What I took away

Batch everything. Validate early. Never trust user input.

The batch size of 10,000 wasn't arbitrary. Below 5,000, the overhead of DB round trips dominated. Above 20,000, memory pressure started climbing without significant throughput gains. 10,000 was the sweet spot.

The pattern that emerged; stream, validate, batch, insert, report — is something I now see everywhere: ETL pipelines, message queues, log processors. It's a universal pattern for handling lots of data with limited resources.

The caching and query optimization work (indexes on gender, age, country_id, composite indexes, connection pooling, result caching with 5-minute TTL) brought the query side in line with our P50/P95 targets. But the CSV ingestion piece — that's what I remember most vividly, because it forced me to think about resource constraints, failure modes, and graceful degradation all at once.

Why I picked this

The CSV ingestion task wasn't just about writing code. It was about engineering for constraints: memory, concurrency, consistency, partial failure. Every decision (batch size, decoding fallback strategy, ON CONFLICT DO NOTHING vs pre-query) was a trade-off. I had to justify each one. That process of articulating why a decision is right, not just that it works is what I think separates engineering from coding.


Part II: Building Reliability Into an AI Interview Platform (Team Task)

What it was

MeetMind is an AI-powered interview platform. Candidates join live audio interviews with an AI interviewer, and the system records transcripts, generates assessments, sends emails, and provides a chat interface for recruiters to query interview data. I really had a lot of fun building this with my teammates (though I nearly lost my mind a couple of times).

The catch? It depends on several third-party APIs:

  • Gemini (Google) for generating interview questions, assessments, and extracting resume information
  • Resend for transactional email delivery (welcome emails, password resets, interview invites)
  • LiveKit for real-time audio/video sessions

Any of these could fail at any moment from network blips, rate limits, service outages, or transient errors. Before this task, a failed API call meant a 500 error and a frustrated user.

My PR to address that issue added two reliability mechanisms:

  1. Retry with exponential backoff — a centralized retry_async utility that wraps any async function with automatic retries, logging, and backoff
  2. Transcript fallback — when individual transcript turns are missing in the database, fall back to the session's stored transcript JSON

The problem

Two concrete failure scenarios:

Scenario A: A candidate finishes an interview, and the system fires off a background task to generate an assessment summary via Gemini. Halfway through, Gemini returns a 429 (rate limit). The assessment fails. The summary is stuck in "generating" status. The recruiter sees a blank report.

Scenario B: Real-time interview transcript turns are stored one by one as the AI interviewer and candidate speak. If the persistence pipeline drops a few turns — or fails entirely — the transcript is incomplete. The recruiter can't review the interview. The AI can't generate an assessment.

Both scenarios had the same root cause: no mechanism for transient failure recovery.

How I approached it

For the retry mechanism, I wanted something that:

  • Works with any async function (generic, reusable)
  • Uses exponential backoff (don't hammer the failing service)
  • Logs warnings on intermediate failures, errors on exhaustion
  • Accepts configurable retry count, delay, backoff factor, and exception types
  • Integrates with existing code without massive refactoring

The signature was clean:

async def retry_async(
    func: Callable[..., T],
    *args,
    max_retries: int = 3,
    initial_delay: float = 2.0,
    backoff_factor: float = 2.0,
    exceptions: tuple[type[BaseException], ...] = (Exception,),
    task_name: str = "Task",
    **kwargs,
) -> T:

For the transcript fallback, the pattern was: try the primary data source first, and if unavailable, reconstruct from the session's transcript_json field. The fallback had to be transparent to callers — get_chat_history, get_transcript, and get_transcript_export all work the same way regardless of which data source backs them.

What broke and how I fixed it

The first bug was an off-by-one in the backoff calculation. I had delay *= backoff_factor happening before the sleep, so the first retry was 2x the initial delay instead of the initial delay itself. It felt trivial, but it meant retries took longer than necessary — 2s, 4s, 8s instead of 1s, 2s, 4s.

Fix: Moved the delay *= backoff_factor to after the sleep.

The second issue was exception type granularity. Initially, the retry caught all Exception subclasses. But some exceptions shouldn't be retried — like ValueError from bad user input, or KeyError from a missing dictionary key. A retry won't fix a programming error.

Fix: Made the exceptions tuple a parameter so callers can specify which exceptions are retryable. For Gemini calls, we retry (Exception,) since the API can fail for many transient reasons. For email delivery, we retry the Resend-specific exception.

The third issue was the transcript fallback format mismatch. The session's transcript_json stored timestamps as Unix seconds, but the transcript endpoints expected them formatted as "HH:MM:SS" elapsed time. The fallback function needed to reproduce the same relative timestamp calculation that the primary path used.

Fix: Created _format_elapsed_timestamp as a shared utility and used it in both the primary and fallback paths. The fallback also had to generate deterministic UUIDs for each fallback turn (using uuid.uuid5 with a namespace) since there were no database IDs available.

What I took away

Graceful degradation > perfect failure. The goal isn't to never fail — it's to fail in a way that doesn't cascade into user-facing errors. The retry mechanism handles transient failures silently. The fallback mechanism ensures that even if the real-time pipeline drops data, users can still review interview transcripts. The user never needs to know something went wrong.

I also learned that reliability is visible in the logs. After the retry mechanism was deployed, we stopped seeing "Assessment generation failed" errors in Sentry. Instead, we saw "Attempt 1/3 failed for Generate assessment... Retrying in 2.00s..." — which is an entirely different class of log. It means the system handled the failure rather than succumbing to it.

Why I picked this

This task taught me that reliability isn't a feature you add at the end — it's a design philosophy that affects how you structure every external dependency call. The retry_async utility now wraps every Gemini generation, every email send, every document embedding. It's invisible infrastructure. But it's the difference between a system that occasionally returns 500s and one that absorbs transient failures and moves on. It is important to build Fault-tolerant systems.

And the transcript fallback taught me something subtler: data can come from multiple sources, and that's okay. The real-time turns are the ideal data source. But the session JSON is always there as a safety net. Engineering for multiple data paths — with clear fallback semantics — makes the system more robust without adding complexity to the API surface.


Bringing It Together

Looking back, both tasks taught me the same lesson from different angles:

Design for failure, optimize for reality.

The CSV ingestion system assumes every row could be bad and handles it gracefully without stopping. The retry mechanism assumes every API call could fail and recovers transparently. The transcript fallback assumes the primary data path could be incomplete and provides an alternative.

Systems that work under pressure aren't the ones that never break. They're the ones that break gracefully, recover quickly, and leave useful evidence behind when they can't.

That's what I'll carry forward from this internship.