Tuesday, March 8, 2022

Apple M1 Ultra

Apple (Hacker News):

Featuring UltraFusion — Apple’s innovative packaging architecture that interconnects the die of two M1 Max chips to create a system on a chip (SoC) with unprecedented levels of performance and capabilities — M1 Ultra delivers breathtaking computing power to the new Mac Studio while maintaining industry-leading performance per watt. The new SoC consists of 114 billion transistors, the most ever in a personal computer chip. M1 Ultra can be configured with up to 128GB of high-bandwidth, low-latency unified memory that can be accessed by the 20-core CPU, 64-core GPU, and 32-core Neural Engine, providing astonishing performance for developers compiling code, artists working in huge 3D environments that were previously impossible to render, and video professionals who can transcode video to ProRes up to 5.6x faster than with a 28-core Mac Pro with Afterburner.

[…]

For the most graphics-intensive needs, like 3D rendering and complex image processing, M1 Ultra has a 64-core GPU — 8x the size of M1 — delivering faster performance than even the highest-end PC GPU available while using 200 fewer watts of power.

Apple:

  • Up to 3.8x faster CPU performance than the fastest 27-inch iMac with 10-core processor.
  • Up to 90 percent faster CPU performance than Mac Pro with 16-core Xeon processor.
  • Up to 60 percent faster CPU performance than 28-core Mac Pro.
  • Up to 4.5x faster graphics performance than the 27-inch iMac, and up to 80 percent faster than the fastest Mac graphics card available today.

Ken Shirriff:

Here are the two dies at the same scale. The M1 Ultra is much, much larger physically [than the ARM1]; I estimate it is 20x47mm. Its transistors are much smaller (5 nm vs 3000 nm) giving it 114 billion transistors instead of 25,000. If built with modern transistors, the ARM1 would be a tiny dot.

Previously:

Update (2022-03-16): Ryan Smith:

The net result is a chip that, without a doubt, manages to be one of the most interesting designs I’ve ever seen for a consumer SoC. As we’ll touch upon in our analysis, the M1 Ultra is not quite like any other consumer chip currently on the market. And while double die strategy benefits sprawling multi-threaded CPU and GPU workloads far more than it does more single-threaded tasks – an area where Apple is already starting to fall behind – in the process they re breaking new ground on the GPU front. By enabling the M1 Ultra’s two dies to transparently present themselves as a single GPU, Apple has kicked off a new technology race for placing multi-die GPUs in high-end consumer and workstation hardware.

Jean-Louis Gassée (Hacker News):

First, benchmarks will reveal that, for a single thread, a single sequence of operations, the M1 Ultra isn’t faster than an entry-level M1 chip. This is because the the clock speed associated with the 5nm process common to all M1 chip hasn’t changed for the M1 Ultra. The newer chip will particularly shine in multithreaded applications generally associated with media development (audio, video, animation…) and some software development. All of which constitute a juicy and traditional enough market for Apple whose control of its macOS system software helps maximize multithreading performance.

Second, the recourse to two M1 Max chips fused into a M1 Ultra means TSMC’s 5 nm process has reached its upper limit.

See also: Ben Sandofsky.

Update (2022-04-14): Vadim Yuryev (Hacker News):

Problem: Apple shows that the M1 Ultra GPU can use up to 105W of power. However, the highest we could ever get it to reach was around 86W.

[…]

Culprit: Each cluster of GPU cores within an M1/M1 Pro/M1 Max/M1 Ultra chip comes with a 32MB TLB or Transaction Lookaside Buffer[…]

[…]

“If an application has not been optimized for the M1 GPU architecture’s tile memory, (not just Metal optimized) then every read/write needs to go all the way out to system memory. If the GPU compute task is issuing MANY little reads, then this will saturate the TLB. The issue is if GPU data hits the TLB and the page table being read/written to is not loaded, then that entire thread group on the GPU needs to pause while the page table is loaded into the TLB. […] The effort needed to optimize for tile memory is MASSIVE.”

Update (2022-04-19): Sami Fathi:

In a rare media interview, Apple’s senior vice president of hardware technologies, Johny Srouji, discussed Apple’s transition to Apple silicon for the Mac, the challenges of developing chips for the Mac amid a global health crisis, and more.

Comments RSS · Twitter

Leave a Comment