Intel Details Ponte Vecchio GPU & Sapphire Rapids HBM Performance, Up To 2.5x Faster Than NVIDIA A100

Date:

During Hot Chips 34, Intel once again described its Ponte Vecchio GPUs running on a Sapphire Rapids HBM server platform.

Intel Shows Off Ponte Vecchio 2-Stack GPU & Sapphire Rapids HBM CPU Performance Against NVIDIA’s A100

In the presentation by Intel Fellow & Chief GPU Compute Architect, Hong Jiang, we get some more details about the upcoming server powerhouses from the blue team. The Ponte Vecchio GPU comes in three configurations, starting with a single OAM and going up to an x4 subsystem with Xe Links, either solo or with a dual-socket Sapphire Rapids platform.

The OAM supports all-to-all topologies for both 4 GPU and 8 GPU platforms. Complementing the entire platform, Intel’s oneAPI software stack is a Level-Zero API that provides a low-level hardware interface to support cross-architecture programming. Some of the key features of the oneAPI are:

  • Interface for oneAPI and other tools to accelerate devices
  • Fine gain control and low latency to accelerator capabilities
  • Multi-threaded design
  • For GPUs, comes as part of the driver

So in terms of performance stats, a 2-Stack Ponte Vecchio GPU configuration like the one on a single OAM is capable of compute up to 52 TFLOPs from FP64/FP32, 419 TFLOPs from TF32 (XMX Float 32), 839 TFLOPs from BF16/FP16 and 1678 TFLOPs with INT8 hp.

Intel also details the maximum cache sizes and peak bandwidth offered by each of them. The registry file size on the Ponte Vecchio GPU is 64 MB and offers 419 TB/s of bandwidth, the L1 cache also comes in at 64 MB and offers 105 TB/s (4:1), and the L2 cache comes in at 408 MB and offers 13 TB/s bandwidth (8:1) while the HBM memory bundles up to 128 GB and offers 4.2 TB/s bandwidth (4:1). There is a range of computational efficiency techniques within Ponte Vecchio, such as:

Register file:

  • Register cache
  • Accumulators

L1/L2 cache:

  • Writing through
  • Write back
  • Write Streaming
  • not cached

Prefetch:

  • Software (instruction) prefetch to L1 and/or L2
  • Command Streamer prefetch to L2 for instruction and data

Intel explains that the larger L2 cache can deliver huge gains in workloads such as 2D-FFT Case and DNN Case. Some performance comparisons are shown between a full Ponte Vecchio GPU and a module configured to 80MB and 32MB.

But that’s not all, Intel also has performance comparisons between the NVIDIA Ampere A100 with CUDA and SYCL and its own Ponte Vecchio GPUs that use SYCL. In miniBUDE, a computational workload that can predict the binding energy of the ligand with the target, the Ponte Vecchio GPU simulates the test results 2 times faster than Ampere A100. There is another performance metric in ExaSMR (Small Modular Reactors for Large Nuclear Reactor Designs). here it is shown that the Intel GPU offers a performance advantage of 1.5x over the NVIDIA GPU.

It’s kind of interesting that Intel is still comparing its Ponte Vecchio GPUs to Ampere A100, as the green team has since launched its next-gen Hopper H100 and has already shipped it to customers. If Chipzilla is that confident in its 2-2.5x performance figures then I don’t think it will have any problems competing well with Hopper unless otherwise.

Here’s everything we know about the Intel 7 powered Ponte Vecchio GPUs

Moving on to the Ponte Vecchio specs, Intel outlined some key features of its flagship data center GPU, such as 128 Xe cores, 128 RT units, HBM2e memory, and a total of 8 Xe-HPC GPUs interconnecting. The chip will hold up to 408MB of L2 cache in two separate stacks that will be connected via the EMIB interconnect. The chip will feature multiple dies based on Intel’s proprietary ‘Intel 7’ process and TSMC’s N7/N5 process nodes.

Intel also previously described the package and die size of its flagship Ponte Vecchio GPU based on the Xe-HPC architecture. The chip will consist of 2 tiles with 16 active dice per stack. The maximum active size of the die will be 41mm2, while the size of the base die, also known as the ‘Compute Tile’, is 650mm2. We’ve listed all the chiplets and process nodes that the Ponte Vecchio GPUs will use below:

  • Intel 7nm
  • TSMC 7nm
  • Foveros 3D packaging
  • EMIB
  • 10nm Enhanced Super Fin
  • Rambo cache
  • HBM2

Here’s how Intel reached 47 tiles on the Ponte Vecchio chip:

  • 16 Xe HPC (Internal/External)
  • 8 Rambo (internal)
  • 2 Xth Base (internal)
  • 11 EMIB (internal)
  • 2 Xe link (external)
  • 8 HBM (external)

The Ponte Vecchio GPU uses 8 HBM 8-Hi stacks and contains a total of 11 EMIB interconnects. The entire Intel Ponte Vecchio package would measure 4843.75 mm2. It also mentions that the bump pitch for Meteor Lake CPUs using High-Density 3D Forveros packaging will be 36u.

The Ponte Vecchio GPU is not 1 chip but a combination of several chips. It’s a chiplet powerhouse, with the most chiplets on any GPU/CPU out there, 47 to be exact. And these are not based on just one process node, but on several process nodes, as we described a few days ago.

Although the Aurora Supercomputer that would use the Ponte Vecchio GPUs and Sapphire Rapids CPUs has been pushed back due to several delays by the blue team, it’s still good to see the company offering more details. Intel has since teased its next-generation Rialto Bridge GPU as the successor to the Ponte Vecchio GPUs, and will reportedly begin sampling in 2023. You can read more details about this here.

Next Generation GPU Accelerators for Data Centers

GPU Name AMD Instinct MI250X NVIDIA Hopper GH100 Intel Ponte Vecchio Intel Rialto Bridge
Packaging design MCM (infinite matter) monolithic MCM (EMIB + Foveros) MCM (EMIB + Foveros)
GPU architecture Aldebaran (CDNA 2) Hopper GH100 Xe-HPC Xe-HPC
GPU Process Node 6nm 4N 7nm (Intel 4) 5nm (Intel 3)?
GPU cores 14,080 16,896 16,384 ALUs
(128 Xe cores)
20,480 ALUs
(160 Xe cores)
GPU clock speed 1700MHz ~1780MHz Not yet known Not yet known
L2/L3 cache 2 x 8MB 50MB 2 x 204MB Not yet known
FP16 Calculate 383 TOPs 2000 TFLOPs Not yet known Not yet known
FP32 Compute 95.7 TFLOPs 1000 TFLOPs ~45 TFLOPs (A0 Silicon) Not yet known
FP64 Compute 47.9 TFLOPs 60 TFLOPs Not yet known Not yet known
Memory capacity 128GB HBM2E 80GB HBM3 128GB HBM2e 128GB HBM3?
Memory clock 3.2 Gbps 3.2 Gbps Not yet known Not yet known
Memory bus 8192-bit 5120-bit 8192-bit 8192-bit
Memory bandwidth 3.2TB/s 3.0TB/s ~3TB/s ~3TB/s
form factor OAM OAM OAM OAM v2
Cooling Passive cooling
Liquid cooling
Passive cooling
Liquid cooling
Passive cooling
Liquid cooling
Passive cooling
Liquid cooling
TDP 560W 700W 600W 800W
launch 4th quarter 2021 2H 2022 2022? 2024?


The Valley Voice
The Valley Voicehttp://thevalleyvoice.org
Christopher Brito is a social media producer and trending writer for The Valley Voice, with a focus on sports and stories related to race and culture.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Popular

More like this
Related

Weisselberg Testifies in Trump Organization Tax Fraud Trial: Live Updates

Nicholas A. Gravante Jr. has been a New...

4 children killed in Mason City house fire

NEXT MONTH. NEW FROM MORNING, MASON CITY SPREADS...

Jason Momoa SLAMMED for climate hypocrisy as star is accused of ‘demonizing plastic bottles’

Jason Momoa has been accused of climate hypocrisy and...