惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
The GitHub Blog
The GitHub Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
P
Proofpoint News Feed
S
Secure Thoughts
GbyAI
GbyAI
N
News and Events Feed by Topic
Google DeepMind News
Google DeepMind News
罗磊的独立博客
博客园 - 【当耐特】
G
Google Developers Blog
A
About on SuperTechFans
IT之家
IT之家
Security Latest
Security Latest
T
Troy Hunt's Blog
李成银的技术随笔
N
News and Events Feed by Topic
V
Vulnerabilities – Threatpost
美团技术团队
博客园 - 叶小钗
B
Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
H
Hacker News: Front Page
Jina AI
Jina AI
K
Kaspersky official blog
博客园 - 三生石上(FineUI控件)
F
Future of Privacy Forum
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Microsoft Azure Blog
Microsoft Azure Blog
NISL@THU
NISL@THU
WordPress大学
WordPress大学
Cyberwarzone
Cyberwarzone
Stack Overflow Blog
Stack Overflow Blog
L
LangChain Blog
P
Palo Alto Networks Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Scott Helme
Scott Helme
The Hacker News
The Hacker News
aimingoo的专栏
aimingoo的专栏
小众软件
小众软件
S
Security @ Cisco Blogs
Martin Fowler
Martin Fowler
Google DeepMind News
Google DeepMind News
人人都是产品经理
人人都是产品经理
D
Docker
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Last Week in AI
Last Week in AI
L
Lohrmann on Cybersecurity
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客

DEV Community

Google used 6,000 open-source contributors then locked the door. Classic. Terceira semana tentando voltar ao mercado de trabalho How I turned a Python function into a web app in one decorator I Got Tired of Heavy Design Tools… So I Built My Own 😩 The Google I/O 2026 Moment That Quietly Changed How I See AI Getting Started: Run Your First Local LLM in 5 Minutes Building a 1% Fee Web3 Marketplace for Study Notes: Is a 5% Shift Sustainable? Full Agentic Stack - 5 Ideias da Arquitetura 'AI-First' que Vão Mudar a Forma Como Você Desenvolve Software Build Club Week Four: the part of Themis Lex I never explained I Tried Google Antigravity 2.0 Here's What It Actually Feels Like to Code With AI Agents By Isaac Yakubu | Google I/O 2026 Challenge Submission The growth quest picks what you avoid, not what you're already good at Firebase AI Logic's Template-Only Mode Is the Security Feature We Actually Needed Hardware Guide: What Do You Actually Need to Run Local LLMs? Constitutional Exception Committees: A Pattern for AI Agent Constraint Governance Veltrix's Treasure Hunt Engine: Optimized for Long-Term Survival, Not Just Scalability Open WebUI: Your Local ChatGPT Build a streaming UI without overcomplicating it The Cost of Kernel CVE Patching Frequency in SLA Commitments Gemma 4 Runs on a Raspberry Pi. Let That Sink In. The Git Filesystem - Recreating the Content-Addressable Database Why I Still Believe Our Event-Driven Architecture Was The Right Call For Veltrix Local RAG: Chat With Your Documents (Open Source, Private) GGUF & Modelfile: The Power User's Guide to Local LLMs What Excited Me Most at Google I/O 2026 OSS assemble! Kilo Code is launching on Product Hunt. Join the launch! https://www.producthunt.com/products/kilocode Your Organizational AI Adoption Metrics Are Lying (Plus How to Measure Real Adoption) Building a Production-Grade MLOps Home Lab on Windows — K8s, LLM, RAG & GitLab CI The Moment I Realized AI Agents are Changing Software Forever Prisma Generator NestJS DTO — pluggable DTOs with annotations and custom generators I Spent a Month Testing Decentralized Poker Sites. Here's What Actually Works. DeepSeek-R1: The $0 o1 Alternative You Can Run Right Now The PHP Stack I Built TrustGate On — And Why I'd Do It Differently Today Building High-Throughput Data Pipelines: Why Chaining Encryption and Compression is a Performance Killer Optic is dead. A 2026 migration guide for OpenAPI breaking changes Smart Blind Stick, Mini Project The NSA just published an MCP security playbook. We created Agent Trust Transport Protocol ATTP - Implement today with MCPS Symfony 8 AWS Secrets Bundle Canlı TV Platformu Geliştirirken Öğrendiğim Teknik Dersler: Streaming, Flussonic ve Performans Gemma 4 Is Powerful — But Production AI Still Needs Governance What RepoSignal Surfaced in React — and Why Review Alone Doesn't Catch Everything LeetCode Solution: 1752. Check if Array Is Sorted and Rotated Breaking the Matrix at 15: How I Built a Cyber-Aesthetic AI Assistant Core Powered by Gemma 4 Разработка Android Kiosk приложения No More Manual Test Writing: How I Used Gemma 4 to Turn a GitHub Repo Into a Full Test Suite 🎯 Trafik Cezaları Platformları Geliştirirken Öğrendiğim Teknik Dersler The Myth of Low Latency: Why Event Meshes Make Your System Slow Building EIDOLON OS — A Local-First AI Cognitive Operating System qrrot - database with AI I Built a Local Gemma 4 Reviewer for Merchant Registry Evidence Compass v1.1.0 · we shipped a memory plugin that catches its own consumption drift
微秒谎言:为什么你的Go计时器在欺骗GPU
Eitamos Ring · 2026-05-24 · via DEV Community

TL;DR 我以为我的CUDA内核运行在160微秒。我错了。这是我如何使用纯Go中的CUDA事件来找到真实的硬件时间,以及为什么CPU端的计时器是GPU取证的错误工具


在我上一篇文章中,我谈到了我通过为Go构建直接CUDA绑定而不使用cgo来删除了一个8.4GB的Python侧边栏。

我让"不可能构建"工作正常后,就开始吹嘘性能了。我把内核启动包装在一个标准的Go time.Since(start) 块中,并看到了 162微秒

我以为我构建了一个速度飞快的恶魔。然后我实现了真正的GPU事件,发现了真相。

谎言般的指标

当你启动一个CUDA内核时,它是完全异步的。CPU不会等待GPU完成;它只是将任务放入队列(一个流)中,并立即将控制权返回给你的Go程序。

我的162微秒测量并不是在测量数学。它只是测量Go运行时与NVIDIA驱动程序通信并将作业入队所需的时间。

GPU还没完成矩阵的第一行,我的计时器就停止了。

硬件真相(RTX 4070 Ti)

要找到真实数据,我不得不实现CUDA事件。这些是你直接放置在硬件流中的标记。GPU在到达标记时会记录一个时间戳,完全绕过CPU时钟。

我在RTX 4070 Ti上运行了一个10M元素向量加法。以下是硬件实际报告的情况:

测量方法 报告时间 实际测量值
CPU time.Since (异步) ~160微秒 提交任务的时间
GPU cuda.Event (实际) ~434微秒 硅上实际计算时间
CPU time.Since (带同步) ~404 微秒 入队 + 执行 + 运行开销

硬件计算时间比我的CPU计时器让我相信的要慢2.7倍

纯Go实现

准确测量这一点需要添加NewEventRecordElapsedTime 分配到 gocudrv 包。由于我们不使用 cgo,我不得不手动绑定 cuEventElapsedTime 符号并处理 C 到 Go 的 float32 转换。

这是现在“说真话”代码的样子:

// 1. Create the hardware stopwatches
start, _ := ctx.NewEvent()
stop, _ := ctx.NewEvent()

// 2. Place markers in the stream
start.Record(stream)
fn.LaunchOn(ctx, stream, cfg, args...)
stop.Record(stream)

// 3. Wait for the STOP marker to be reached
stop.Synchronize(ctx)

// 4. Get the hardware duration
duration, _ := start.Elapsed(stop)
fmt.Printf("Actual GPU time: %v\n", duration)

全屏模式 退出全屏模式

人工智能基础设施的教训

随着我们转向基于Go的AI基础设施,我们必须小心处理"测量漂移"

。如果你用Go构建推理网关或实时图像处理器,使用CPU计时器会在纸面上让你的P99s看起来非常出色,而你的用户却会体验到神秘的延迟。

你无法优化你无法衡量的东西。 如果你没有使用硬件事件,你只是测量了你的请求队列的速度,而不是你的产品的速度.

下一步是什么?

现在我有了毫秒级精确的秒表,我终于可以开始优化数据路径了。我目前正在处理 CUDA Graphs 通过将复杂的任务拓扑结构捆绑成一个硬件命令来减少 160µs 的入队开销。

如果你对低级 Go 的取证感兴趣,或者想帮助构建无 cgo 桥,请查看 GitHub 上的进展。

https://github.com/eitamring/gocudrv