惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
博客园_首页
H
Hackread – Cybersecurity News, Data Breaches, AI and More
T
ThreatConnect
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - 聂微东
H
Help Net Security
T
Threat Research - Cisco Blogs
Blog — PlanetScale
Blog — PlanetScale
A
Arctic Wolf
G
Google Developers Blog
量子位
U
Unit 42
I
InfoQ
V
V2EX
F
Fox-IT International blog
P
Privacy & Cybersecurity Law Blog
V
Visual Studio Blog
J
Java Code Geeks
大猫的无限游戏
大猫的无限游戏
C
CERT Recently Published Vulnerability Notes
博客园 - 三生石上(FineUI控件)
T
The Exploit Database - CXSecurity.com
T
Tailwind CSS Blog
SecWiki News
SecWiki News
Know Your Adversary
Know Your Adversary
MyScale Blog
MyScale Blog
宝玉的分享
宝玉的分享
The Hacker News
The Hacker News
Project Zero
Project Zero
Application and Cybersecurity Blog
Application and Cybersecurity Blog
月光博客
月光博客
Recent Commits to openclaw:main
Recent Commits to openclaw:main
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
G
GRAHAM CLULEY
C
Cisco Blogs
I
Intezer
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
Recorded Future
Recorded Future
T
Tenable Blog
W
WeLiveSecurity
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
T
The Blog of Author Tim Ferriss
www.infosecurity-magazine.com
www.infosecurity-magazine.com
D
Docker
C
Cybersecurity and Infrastructure Security Agency CISA
PCI Perspectives
PCI Perspectives

Lei Mao's Log Book

Carquinez Strait Regional Shoreline Carquinez Strait Regional Shoreline 徒步 脸庞 PyTorch Triton Kernel Transparent Tracing and Compilation PyTorch Fake Export 2026 Wild and Scenic Film Festival 2026 Wild and Scenic Film Festival 参观 2026 BRAIN Foundation 10K 竞赛 系统工程程序员修 Bug FIFA 官方网站的语言 PyTorch Custom Operation 汉堡王 The Mandalorian and Grogu 套餐 Tilden Regional Parks Botanic Garden Tilden Regional Park Tilden Regional Parks Botanic Garden 参观 Tilden Regional Park 徒步 《寻秦记》电影版 ICML 2026 Area Chair Experience 2026 Foster City 5K Fun Run 竞赛 2026 年 3 月和 4 月该入手的模型手办 Docker Container GUI Display Using Wayland 马拉松破二 2026 Heart & Soles Run 5K 竞赛 算计: 七天的死亡游戏 How Is FARS, The Fully Automated Research System? Lake Chabot Regional Park Lake Chabot Regional Park 徒步 2023 年恐怖电影《感恩节》 2026 Airport Runway Run at San Carlos Airport 5K 竞赛 Page Table for Page-Locked Host Memory Don Edwards San Francisco Bay National Wildlife Refuge - Ravenswood Don Edwards San Francisco Bay National Wildlife Refuge - Ravenswood 徒步 法外风云 PyTorch Graph Symbolic Integer Contra Costa Canal Regional Trail 徒步 Contra Costa Canal Regional Trail 娑婆诃 PyTorch Export 2026 Western Pacific 5K 竞赛 浮躁的科研和胡扯的自媒体 Connecting Logitech Devices On Linux 2026 Oakland Half Marathon 竞赛 Del Valle Regional Park 徒步 Del Valle Regional Park 踏切时间 CUDA_LAUNCH_BLOCKING=1 Wildcat Canyon Regional Park 徒步 Wildcat Canyon Regional Park Credit Card Unauthorized Transaction While In Possession 莎拉的真伪人生 2026 Union City Superhero Fun Run 5K 竞赛 McLaughlin Eastshore State Park 徒步 McLaughlin Eastshore State Park Cloudflare Worker Proxy R2 Bucket Access Zorro 和 Batman 2026 Brazen Victory 10K 竞赛 Fix MacBook Pro Space Key Stuck Problem 2026 年 1 月和 2 月该入手的模型手办 儿时的玩伴李峰 Marsh Creek Regional Trail 徒步 Marsh Creek Regional Trail Perfetto GPU Flow Artifacts 百万人推理 System Performance Optimizations QQ 幻想 2026 Brazen Bay Breeze 5K 竞赛 CUDA Shared Memory Bank Conflict-Free Vectorized Access Dota 闪电站出售 Mountain View Downtown Mountain View Downtown 徒步 C++ Latch and Barrier 2025 年跑步总结 2026 Rotary Mission Ten Half Marathon 竞赛 狗的素质等于人的素质 CUDA Rendezvous Stream Pleasanton Ridge Regional Park Pleasanton Ridge Regional Park 徒步 Xfinity Internet 多年来的使用感受 Randomized SVD Don Castro Regional Recreation Area Don Castro Regional Recreation Area 徒步 拖车公司的大汉们
PyTorch AOTInductor Hybrid Lowering
2026-05-28 · via Lei Mao's Log Book
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
from pathlib import Path

import torch
import torch.nn as nn
import torch.profiler
from torch._inductor import aoti_compile_and_package, aoti_load_package
from torch._subclasses.fake_tensor import FakeTensorMode


class MLP(nn.Module):
"""MLP configurable across CPU, GPU, or a CPU-GPU hybrid split.

fc1 (+ GELU) is placed on *fc1_device*; fc2 is placed on *fc2_device*.
When the two devices differ the forward pass inserts an explicit device
transfer, preserved as an aten._to_copy node in the exported graph.
When they are the same the transfer is a no-op.
"""

def __init__(
self,
in_features: int,
hidden_features: int,
out_features: int,
fc1_device: torch.device = torch.device("cpu"),
fc2_device: torch.device = torch.device("cpu")
) -> None:
super().__init__()
with torch.device(fc1_device):
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = nn.GELU()
with torch.device(fc2_device):
self.fc2 = nn.Linear(hidden_features, out_features)

def forward(self, x: torch.Tensor) -> torch.Tensor:
h = self.act(self.fc1(x))

h = h.to(self.fc2.weight.device)
return self.fc2(h)


def aoti_compile(model: nn.Module, x: torch.Tensor,
package_path: str) -> object:
"""Export and AOTInductor-compile a model.

A fake input with the same shape/dtype/device as x is used so that
torch.export can trace the graph without allocating real activation memory.
Works for any device (cpu, cuda) and any model topology.
"""
with FakeTensorMode():
fake_input = torch.empty(x.shape, dtype=x.dtype, device=x.device)
ep = torch.export.export(model, (fake_input, ))
compiled_package = aoti_compile_and_package(ep, package_path=package_path)
return aoti_load_package(compiled_package, run_single_threaded=True)


def profile_runner(runner,
x: torch.Tensor,
trace_path: str,
label: str,
warmup: int = 3,
steps: int = 5) -> None:
"""Profile an AOTI runner and export a Chrome trace to *trace_path*.

Note: AOTI runners call compiled C++ directly, bypassing the ATen
dispatcher's profiling hooks. As a result, no cpu_op events (e.g.
aten::mm, aten::gelu) appear in the trace — the runner executes as an
opaque native call from the profiler's perspective. What the trace does
capture are CUDA runtime events (cudaLaunchKernel) and, when CUPTI is
available, actual GPU kernel execution on the device timeline.
"""
activities = [
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
]
schedule = torch.profiler.schedule(wait=0,
warmup=warmup,
active=steps,
repeat=1)
with torch.profiler.profile(
activities=activities,
schedule=schedule,
record_shapes=True,
with_flops=True,
) as prof:
for step in range(warmup + steps):
with torch.profiler.record_function(f"step_{step}"):
runner(x)
prof.step()
prof.export_chrome_trace(trace_path)
print(f"{label} trace written to {trace_path}")


if __name__ == "__main__":

cpu_device = torch.device("cpu")
gpu_device = torch.device("cuda")

artifacts_dir = Path(__file__).parent / "aoti_artifacts"
artifacts_dir.mkdir(exist_ok=True)

model_cpu = MLP(128, 256, 10, fc1_device=cpu_device,
fc2_device=cpu_device).eval()
x_cpu = torch.randn(4, 128, device=cpu_device)
runner_cpu = aoti_compile(model_cpu, x_cpu, str(artifacts_dir / "cpu.pt2"))
torch.testing.assert_close(runner_cpu(x_cpu), model_cpu(x_cpu))
print("AOTInductor compile (CPU) succeeded.")
profile_runner(runner_cpu, x_cpu, str(artifacts_dir / "cpu_trace.json"),
"AOTInductor (CPU)")

model_gpu = MLP(128, 256, 10, fc1_device=gpu_device,
fc2_device=gpu_device).eval()
x_cuda = torch.randn(4, 128, device=gpu_device)
runner_gpu = aoti_compile(model_gpu, x_cuda,
str(artifacts_dir / "cuda.pt2"))
torch.testing.assert_close(runner_gpu(x_cuda), model_gpu(x_cuda))
print("AOTInductor compile (GPU) succeeded.")
profile_runner(runner_gpu, x_cuda, str(artifacts_dir / "cuda_trace.json"),
"AOTInductor (GPU)")

model_hybrid = MLP(128,
256,
10,
fc1_device=cpu_device,
fc2_device=gpu_device).eval()
x_hybrid = torch.randn(4, 128, device=cpu_device)
runner_hybrid = aoti_compile(model_hybrid, x_hybrid,
str(artifacts_dir / "hybrid.pt2"))
torch.testing.assert_close(runner_hybrid(x_hybrid), model_hybrid(x_hybrid))
print("AOTInductor compile (CPU-GPU hybrid) succeeded.")
profile_runner(runner_hybrid, x_hybrid,
str(artifacts_dir / "hybrid_trace.json"),
"AOTInductor (CPU-GPU hybrid)")