惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

P
Proofpoint News Feed
博客园 - 聂微东
Application and Cybersecurity Blog
Application and Cybersecurity Blog
MyScale Blog
MyScale Blog
罗磊的独立博客
H
Help Net Security
L
LangChain Blog
T
Threat Research - Cisco Blogs
量子位
S
Securelist
Last Week in AI
Last Week in AI
L
Lohrmann on Cybersecurity
T
The Exploit Database - CXSecurity.com
P
Privacy International News Feed
The Hacker News
The Hacker News
Vercel News
Vercel News
D
Darknet – Hacking Tools, Hacker News & Cyber Security
C
Cybersecurity and Infrastructure Security Agency CISA
T
The Blog of Author Tim Ferriss
T
Threatpost
Security Latest
Security Latest
P
Palo Alto Networks Blog
Microsoft Security Blog
Microsoft Security Blog
NISL@THU
NISL@THU
F
Full Disclosure
WordPress大学
WordPress大学
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
酷 壳 – CoolShell
酷 壳 – CoolShell
H
Heimdal Security Blog
J
Java Code Geeks
Recorded Future
Recorded Future
Hugging Face - Blog
Hugging Face - Blog
G
GRAHAM CLULEY
Know Your Adversary
Know Your Adversary
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
阮一峰的网络日志
阮一峰的网络日志
U
Unit 42
B
Blog RSS Feed
月光博客
月光博客
C
Cisco Blogs
V
Visual Studio Blog
D
DataBreaches.Net
H
Hacker News: Front Page
博客园 - 叶小钗
N
News and Events Feed by Topic
爱范儿
爱范儿
A
Arctic Wolf

The Register - Special Features: AWS Re:invent

DJ Garman drops the ball instead of the bass in AWS re:Invent keynote Amazon keeps the pressure on Intel, AMD with 192-core Graviton5 CPU Amazon is forging a walled garden for enterprise AI AWS offers AI-in-a-box for enterprise datacenters AWS admits AI coding tools cause problems, reckons its three new agents fix 'em AWS joins Microsoft, Google in the security AI agent race AWS: How do you do, fellow kids? Please watch our keynotes in Fortnite AWS, Google roll out multi-cloud fix they said wasn't needed AWS under pressure as big three battle to eat the cloud market Countries use cyber targeting to plan strikes: Amazon CSO EU eyes AWS, Azure for gatekeeper tag in cloud clampdown Geopolitics push European CIOs to think local on cloud Atlassian moves Jira, Confluence instances to AWS Graviton
Amazon primed to fuse Nvidia's NVLink into 4th-gen Trainium accelerators
2025-12-03 · via The Register - Special Features: AWS Re:invent

Re:Invent Amazon says that its next generation of homegrown silicon will deliver 6x higher performance thanks to a little help from its buddy Nvidia.

At its Re:Invent convention in Las Vegas on Tuesday, Amazon Web Services (AWS) teased its Trainium4 accelerators, which will be among the first to embrace Nvidia's NVLink Fusion interconnect tech for chip-to-chip communications.

NVLink is a high-speed interconnect that allows multiple GPUs spanning multiple systems to pool resources and behave like a single accelerator. Previously, this technology has been limited to Nvidia CPUs and GPUs, but back in May, the AI infrastructure giant announced it was opening the tech to others with the introduction of NVLink Fusion at Computex.

Amazon claims that the technology will allow its Trainium4 accelerators, Graviton CPUs, and EFA networking tech to communicate seamlessly across Nvidia's MGX racks.

In its current form, Nvidia's fifth-gen NVLink fabrics support up to 1.8 TB/s of bandwidth (900 GB/s in each direction) per GPU, but the company is on track to double that to 3.6 TB/s by next year.

Beyond Nvidia's interconnect tech, details are somewhat vague. We're told that the new chips will deliver 3x more FLOPS at FP8, 6x the performance at FP4, and 4x the memory bandwidth. Whether those claims pertain to the individual chips or its UltraServer rack systems, Amazon hasn't said.

Assuming it's the rack systems, as was the case with Trainium3, that suggests AWS's Trainium4 UltraServers could deliver upwards of 2 exaFLOPS of dense FP4 performance and 2.8 petabytes a second of memory bandwidth.

That latter point is likely to be a major boon for bandwidth-bound inference workloads. Despite a rather confusing naming convention, AWS actually employs Trainium for both internal and external training and inference.

Of course, the devil is in the details and we simply don't have all of them yet. Amazon made similar claims about its Trainium3 UltraServers this time last year, boasting a 4.4x uplift in compute over its Trainium2 racks. But while technically true, what we didn't know at the time was roughly half that performance would be achieved by more than doubling the number of chips from 64 to 144.

Trainium3 arrives on EC2

Speaking of Trainium3, a year after first teasing the chips, Amazon is finally ready to bring its third generation of Trainium accelerators to the general market.

According to AWS, each chip is equipped with 144 GB of HBM3E memory, good for 4.9 TB/s of memory bandwidth, and is capable of churning out just over 2.5 petaFLOPS of dense FP8 performance.

However, for jobs that benefit from sparsity, like training, the chips are even more potent. Trainium3 features 16:4 structured sparsity, which effectively quadruples the chip's performance to 10 petaFLOPS for supported workloads.

Amazon's Trainium3 UltraServers cram 144 of these chips connected in an all-to-all fabric using its NeuronSwitch-v1 interconnect tech, which Amazon says offers twice the chip-to-chip bandwidth.

This is a marked change from Amazon's Trainium2 UltraServers, which featured 64 accelerators arranged in a 4x4x4 3D torus topology.

Amazon declined to comment on how the 144 Trainium3 accelerators are connected to one another, but if we had to guess, it likely resembles the flat switched topology used in Nvidia's NVL72 or AMD's Helios rack systems.

Such a move should ease the transition to NVLink Fusion next generation, but leaves Google as one of the few chip designers left using mesh topologies in large-scale AI training and inference clusters.

In any case, Amazon seems confident that its new interconnect tech and EFA networking will enable it to support production deployments containing up to a million accelerators, compared to the 500,000 Trainium2 chips found in Project Rainier.

Combined, each Trainium3 UltraServer features 20.7 TB of HBM3E, 706 TB/s of memory bandwidth, and between 363 and 1,452 petaFLOPS depending on whether your workload actually benefits from sparsity or not.

This puts the systems roughly on par with Nvidia's latest Blackwell Ultra-based GB300 NVL72 systems – at least at FP8. At FP4, the gap widens considerably with the Nvidia system delivering more than 3x the performance.

With that said, FP4 is still primarily used in inference, while higher-precision datatypes like BF16 and FP8 are preferred for training.

Despite Trainium's advancements in performance, some customers still aren't ready to abandon Nvidia just yet. Because of this, Amazon has also announced the availability of new compute offerings based on Nvidia's GB300 NVL72, which join the company's existing GB200 instances. ®