惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Build a Secure API with Rails 8 - Part-3: Auth Controllers Google I/O 2026’s Smartest Developer Release Wasn’t a Model, It Was the Runtime - Managed Agents in Gemini API OSS Monthly Recap: What My Daily Commit Challenge Taught Me About Open Source “Culture” GemmaNotes Cognitive Debt: AI Is Building Your Systems. Do You Actually Understand Them? GeekNews Frontend Weekly Deep Dive - 2026-05-25 I Built a Universal Silicon Loader That Runs on Any SOC (No Bootrom Exploit) Docker容器化部署Node.js应用最佳实践 I Put a Neural Network in a Thermometer — Then It Got Out of Hand Building MGZon: Developer Portfolio + AI Bot + Social Network (9 min demo) Bearing Life (L10): What the Catalog Number Really Tells You Longhorn Volume Health: The Gap Between 'Healthy' and Actually Working Stop Prompting. Start Specifying: How Spec-Driven Development Fixes AI Coding TIL a PowerPoint file is just a zip — so I converted .pptx to Word entirely in the browser 로컬 LLM 셋업 가이드 (v18) Cx Dev Log — 2026-04-24 github's agent audit api is the boring feature that matters # From Teaching Code to Building Real-World Applications Vivado 2026.1 and Linux: why this decision matters beyond the headline Vivado 2026.1 y Linux: por qué la decisión importa más allá del titular ORA-00206 오류 원인과 해결 방법 완벽 가이드 Entidades finas e composição: o design que escolhi para a nova plataforma 10 Open Source Tools Every Developer Should Know 🔥 SSH Config File Mastery: Turning `~/.ssh/config` Into a Productivity Tool I tried to create a programming language... in python I Replaced 70MB Node.js Log Viewer with a 172KB Zig Binary I Turned npm outdated into a CI Gate — Here's How Don't fall for the Claude Mythos hype Vestige: A Gemma 4 Brain Tracker That Won't Blow Smoke Up Your Ass Gemminate: Transforming Static Textbooks into Interactive Learning Journeys with Gemma 4 Where Did All the Code Playgrounds Go? I built PROOFER - Privacy first Chrome extension that proofreads your texts using Gemma 4 I Automated My Entire Digital Product Business on a $13/Month GCP VM. Here's the Architecture. Beginner's Mind in Engineering and AI How I use AI agents to turn ideas into public demos I Built a Quotation Generator for Kenyan Street Welders Using Gemma 4's Vision The Math Behind Neural Networks — Explained Like Nobody Did for Me 🧨 Understanding TPC with IEEE802.11h What I’m Starting to Look for in Engineers An npm Downloads Comparison Chart in 300 Lines of Vanilla JS — Nice-Tick Math and API-Direct Fetch Vitreus: Local-First Spreadsheet Intelligence with Gemma 4 Transfer Fees, Metadata, and Soulbound Tokens: A Tour of Solana Token Extensions I got tired of re-explaining my codebase to ChatGPT — so I built a VS Code extension Revisiting My Phone AI After Gemma 4: The Upgrade I Didn't Know I Needed I built a privacy-first PDF merger in 7 hours — here's the stack and the lessons Google I/O 2026 made me ask an uncomfortable question: are we still coding, or are we managing builders? SSR with JavaScript: Escaping Node.js Clunkiness with AxonASP My CKA Exam-Day Experience: What Went Right, What Went Wrong, and Lessons Learned Gemma 4 Soft Tokens: The Rise and Fall of 16x16 Words ⚡👀 Two weeks ago, I built a private AI brain on my phone using Gemma 4. Yesterday, Google dropped a new variant that made everything I built feel like a beta test. 256M parameters. MoE architecture. Apache 2.0 license. I broke down what changed and why it mat I got tired of clicking through the Stripe dashboard, so I built a CLI Getting Data from Multiple Sources in Power BI: A Practical Guide to Modern Data Integration Google Is No Longer Just a Search Engine I built GemmaPod - A truly composable and portable AI agent solution powered by your local LLM Gemma 4 E4B caught three planted fabrications in 50 seconds — on a laptop, no cloud How to build an AI-powered content moderation pipeline for user comments Running Gemma 4 on a Modest Machine: Unsloth vs LM Studio vs llama.cpp vs Ollama AI Makes Building Cheap. Our Product Architectures Still Assume It’s Expensive. I built an in-browser Roku TV remote with ~80 lines of TypeScript. Here's how Roku's ECP API actually works The Direction of Blame babbled notes: a sound-to-music agent for people who could not make music before How I Built a Live SQL Workshop Where Students Can't Break Anything Rescuing a Stranded Protocol: Re-Skinning Legacy Code for the Trestle DeFi Flywheel SOLID Heuristics Reveal Incomplete Domain Knowledge — Nothing More AllasCode Intitute / FullAgenticStack: The Intent-Based Router Introducing LogicGrid — Multi-Agent AI Orchestration for .NET AI Prompt Injection, Drupal SQLi Exploitation, and Nmap for Hardening AI Agents & Python Workflows: Anthropic Skills, Jupyter Challenges, and Edge Deployment SQLite Optimization, PostgreSQL Async Queries, & DuckLake Dataframe Spec RTX 5080 Undervolt Benchmarks, CGO-Free CUDA API Binding, & AMD GPU Compatibility Fix Microsoft Burned Its 2026 AI Budget on Claude Code in Six Months. That's the Real Story. Why I Started Learning FastAPI in 2026 I Abandoned Ghost for Months — Then Came Back and Finally Finished It Building an Open MIT-Licensed Ephemeris Engine in C — JPL Moshier Ephemeris 4 Smart Ways to Manage Retries in Side Projects Securing Web APIs: A Practical Guide to Authentication & Authorization Methods Google I/O 2026: AI Built an OS in 12 Hours. I Spent Mine Sorting Screenshots. 🤦 Half a Day, Not a Week: One Nix Flake for Three Machines 🌱 Keep Feeding Your CI/CD — Or Watch It Die Gemma 4 vs GPT-4o vs Llama 3: What Actually Works Locally? Vessel Ops SSH in 2026: Why Every Developer Should Know It Cold Audit AI-Generated PRs Before You Merge Them (Swarm Orchestrator 10.3.0) App Store Optimization (ASO) I built a tool to visualize Django REST Framework architecture (URLs, Serializers, Models, and more) How I made my React site agent-ready in 100 lines AI Can Generate Interfaces on the Fly. But Users Still Need Orientation. AI-Assisted Content Workflow How We Learned That Most Resume Rejections Happen Before Humans See Your CV How I Prepared for CKA: Resources, Labs, and Strategy That Worked for Me Remix Mini PC: Moving the Whole Operating System Onto the eMMC Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+ Agent Frameworks The Misleading "User is not authorized to access connection" Error in AWS CodeBuild — and Why Your IAM Policy Looks Fine I Resurrected a Dead F1 Project and Accidentally Built a Race Intelligence OS Remix Mini PC: After a Year of Dead Ends, the eMMC Finally Talks Not All Games Are Equal: The Real Difference Between a Trap and a Tool How to add Peppol e-invoicing to your SaaS without making it your team's problem I Built a Hermes Agent to Tell Me Which Hackathons to Enter. It Told Me to Enter This One. The Five Hooks That Change How You Ship With Claude Code Powering Your Progress: Building Robust Solutions with Laravel
I A/B tested 4 LLMs on the same 500 queries. The results surprised me.
NovaStack · 2026-05-25 · via DEV Community

I see a lot of claims about which model is "best." Best at what? For whom? At what cost?

I got tired of guessing. So I ran my own comparison.

The setup
I took 500 real queries from my production logs – a mix of:

Code generation (120 queries)

Document summarization (150 queries)

Question answering (180 queries)

Creative writing (50 queries)

I ran each query through four models using the same prompt, same temperature (0.7), same everything.

The models:

DeepSeek-V4 Pro

Kimi 2.6

MiniMax 2.7

Qwen3 235B

I used NovaStack as the gateway – one API endpoint that let me switch models by changing one parameter. Saved me from writing integration code for four different providers.

What I measured
Response time (end-to-end latency)

Cost per query

Accuracy (human-rated on a 1-5 scale, two reviewers)

The surprising results
Fastest model: DeepSeek-V4 Pro (avg 1.8s). Qwen3 was slowest (avg 4.2s) – not surprising given its size.

Cheapest model: MiniMax 2.7 (40% cheaper than DeepSeek on similar tasks).

Most accurate overall: Qwen3 235B (4.3/5). But here's the catch – it wasn't best at everything.

Task type Best model Runner-up
Code generation DeepSeek-V4 Pro (4.6) Qwen3 (4.2)
Long doc summarization Kimi 2.6 (4.7) Qwen3 (4.1)
QA (short context) DeepSeek (4.4) MiniMax (4.2)
Creative writing Qwen3 (4.5) Kimi (4.0)
The biggest surprise: No single model won more than 45% of the task categories. The "best" model depends entirely on what you're doing.

What this means for real-world use
If you're building a production system, picking one model leaves performance on the table.

I now route based on task type:

text
Code task → DeepSeek-V4 Pro
Long document → Kimi 2.6

Image-related → MiniMax 2.7
Complex reasoning → Qwen3 235B
Everything else → DeepSeek (fast + cheap)
What broke during testing
Rate limits were inconsistent – Some models throttled me after 50 requests/minute, others after 200. I had to add per-model rate limiters.

Streaming latency hid real performance – One model sent the first token in 200ms but took 5 seconds to finish. Another took 1s to start but finished in 2s total. Measure end-to-end, not time-to-first-token.

Model responses vary in length – Even with the same prompt, Qwen3 wrote 30% longer responses than MiniMax. This affects cost and user experience.

Human rating is expensive – Two reviewers spent 6 hours rating 500 responses. Worth doing once, but not weekly.

If you want to run your own test
NovaStack (the gateway I used) offers new users credits at novapai.ai/en-US/. Enough to run a few hundred queries through all four models.

The script I used is simple:

python
models = ["deepseek-v4-pro", "kimi-2.6", "minimax-2.7", "qwen3-235b"]
results = []

for model in models:
start = time.time()
response = requests.post(
"https://api.novapai.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {KEY}"},
json={"model": model, "messages": messages}
)
latency = time.time() - start
results.append({"model": model, "latency": latency, "response": response.text})
Questions for the community
What task types have you found surprising differences between models? I want to expand my benchmark.

How do you handle per-model rate limits in production? My simple retry-with-backoff feels inadequate.

Has anyone tried dynamic routing based on real-time cost/latency? Curious if that's worth the complexity.

I'll share the full benchmark dataset and rating rubric if there's interest. Just comment or DM.