The Microsecond Lie: Why your Go timers are lying about the GPU

TL;DR: I thought my CUDA kernel was running in 160 microseconds. I was wrong. Here is how I used CUDA Events in pure Go to find the real hardware time, and why CPU-side timers are the wrong tool for GPU forensics.

In my last post, I talked about how I deleted an 8.4GB Python sidecar by building direct CUDA bindings for Go without cgo.

Once I had the "Impossible Build" working, I started bragging about the performance. I wrapped my kernel launch in a standard Go time.Since(start) block and saw 162 microseconds.

I thought I had built a speed demon. Then I implemented real GPU Events and found the truth.

The Misleading Metric

When you launch a CUDA kernel, it is completely asynchronous. The CPU doesn't wait for the GPU to finish; it just puts the task in a queue (a Stream) and returns control to your Go program immediately.

My 162-microsecond measurement wasn't measuring the math. It was only measuring how long it took the Go runtime to talk to the NVIDIA driver and enqueue the job.

The GPU hadn't even finished the first row of the matrix before my timer stopped.

The Hardware Truth (RTX 4070 Ti)

To find the real numbers, I had to implement CUDA Events. These are markers you place directly into the hardware stream. The GPU itself records a timestamp when it reaches the marker, bypassing the CPU clock entirely.

I ran a 10M element vector addition on an RTX 4070 Ti. Here is what the hardware actually said:

Measurement Method	Reported Time	What it actually measured
CPU `time.Since` (Async)	~160 µs	Time to enqueue the work
GPU `cuda.Event` (Actual)	~434 µs	Actual compute time on Silicon
CPU `time.Since` (with Sync)	~404 µs	Enqueue + Execution + Runtime overhead

The hardware compute time was 2.7x slower than what my CPU timers led me to believe.

Implementation in Pure Go

Measuring this accurately required adding NewEvent, Record, and ElapsedTime to the gocudrv package. Since we aren't using cgo, I had to bind the cuEventElapsedTime symbols manually and handle the C-to-Go float32 conversion.

Here is what the "truth-telling" code looks like now:

// 1. Create the hardware stopwatches
start, _ := ctx.NewEvent()
stop, _ := ctx.NewEvent()

// 2. Place markers in the stream
start.Record(stream)
fn.LaunchOn(ctx, stream, cfg, args...)
stop.Record(stream)

// 3. Wait for the STOP marker to be reached
stop.Synchronize(ctx)

// 4. Get the hardware duration
duration, _ := start.Elapsed(stop)
fmt.Printf("Actual GPU time: %v\n", duration)

The Lesson for AI Infrastructure

As we move toward Go-based AI infrastructure, we have to be careful about "Measurement Drift."

If you are building an inference gateway or a real-time image processor in Go, using CPU timers will make your P99s look incredible on paper while your users experience mysterious latency.

You can't optimize what you can't measure. If you aren't using hardware events, you are just measuring the speed of your request queue, not the speed of your product.

What's Next?

Now that I have a microsecond-accurate stopwatch, I can finally start optimizing the data path. I'm currently working on CUDA Graphs to reduce that 160µs enqueuing overhead by bundling complex task topologies into a single hardware command.

If you're interested in the forensics of low-level Go or want to help build the cgo-free bridge, check out the progress on GitHub.

https://github.com/eitamring/gocudrv

推荐订阅源