HPE to build Discovery exascale successor for Oak Ridge

HPE's Discovery to succeed Frontier supercomputer with next-gen Cray tech

Oak Ridge's $500M system due in 2028, paired with a separate Lux AI cluster arriving two years earlier

HPE is set to build a successor to the Frontier exascale system for America's Oak Ridge National Laboratory, based on the next generation of its Cray supercomputer platform, plus a separate AI cluster to advance machine learning with a multi-tenant cloud-like platform.

The Discovery system will "bolster productivity up to 10x," according to HPE, and like many other supercomputers will be used for scientific research into various areas including medicine, cancer research, nuclear energy, and aerospace.

ORNL GX 3D system mock-up Discovery — Mock-up of HPE's forthcoming Discovery GX5000 system

Oak Ridge issued a request for proposals (RFP) for a successor to Frontier last year, with an expected delivery date of late 2027 to early 2028 and anticipated budget of $500 million.

REG AD

HPE now says delivery of Discovery is expected in 2028, with user operations set to begin in 2029.

REG AD

The national laboratory will also receive a second HPE-built system, Lux, the AI cluster intended to support both training and inference work at the site. This is expected to be installed early in 2026.

Discovery will be based on HPE's Cray Supercomputing GX5000, the next iteration of its supercomputing architecture, and will also feature a new Cray Storage Systems K3000 running the DAOS object storage platform, plus the next generation of Cray's Slingshot high-performance networking.

HPE says the Discovery nodes will be built with AMD's "Venice" (a code name) server processors, which are not due to be launched until next year, plus Instinct MI430X GPUs – also due next year – for the level of performance required for modeling, simulation, and AI projects.

However, HPE did not disclose how many nodes or CPUs and GPUs will go into building Discovery, or how much memory the system will have.

For interconnect, it will use the next generation of Slingshot networking HPE gained when it acquired Cray, although this has yet to launch and the company didn't give a date as to when it will. The current Slingshot 11 supports 200 Gbps per port, and can be regarded as a superset of Ethernet.

Discovery will be supported by Cray Storage Systems K3000, which HPE claims will support up to 75 million input/output operations per second per storage rack, 4x more performance than the next 30 storage systems on the IO 500 list, according to the firm.

This will be based on the open source DAOS (Distributed Asynchronous Object Storage) platform, but will complement rather than replace the Lustre file system-based Cray Storage Systems E2000, which will also be included in Discovery.

DAOS was developed by Intel, but farmed out to an independent foundation after the chipmaker canceled its Optane memory technology in 2022 and lost interest. HPE then hired Intel's DAOS engineers and brought them into its own storage team.

REG AD

Lux, meanwhile, is set to be an all-AMD affair, based on liquid-cooled HPE ProLiant Compute XD685 nodes with Epyc CPUs, Instinct MI355X GPUs, and linked together using AMD's Pensando SmartNIC networking.

Liquid cooling innovations

Crosshead text

Trish Damkroger, HPE's senior VP for HPC and AI Infrastructure Solutions, told The Register that the GX5000 had been in the works for years, but the company had "made some pivots over the last year and a half, as we've seen the growth of TDPs (thermal design points), the growth of different silicon coming out from all the vendors, and the need to be able to support all of these different workloads."

She said the racks will be able to accommodate up to 25 kilowatts per compute slot, 127 percent higher than before. But she seemed prouder of the liquid cooling for the GX5000 infrastructure, which now supports 40°C (104°F) water to meet new energy requirements for a lot of customers in Europe.

This means additional chillers and refrigerators are not needed, which cuts power, so it is a much more energy-efficient system for upcoming deployments.

"It is a bookend design," she said. "So basically, the cooling pump is designed to be more compact. And can be placed on the side of the system instead of in the middle. And each pump is going to have redundancy to ensure that there's always-on operation."

Damkroger added that users can now control the water flow rate, so instead of every single blade having the same, it can be optimized for each blade and its workloads.

REG AD

HPE said there will be an opportunity to see the new GX5000 infrastructure at the SC 25 high-performance compute conference in St. Louis, Missouri, next month, though the platform is not expected to be available to customers until early 2027. ®

推荐订阅源

The Register - Special Features: Supercomputing Month

HPE's Discovery to succeed Frontier supercomputer with next-gen Cray tech

Crosshead text