Wednesday, October 30, 2024

Apple M4 Pro and M4 Max

Apple (Hacker News, MacRumors):

All three chips are built using industry-leading, second-generation 3-nanometer technology, which improves performance and power efficiency. The CPUs across the M4 family feature the world’s fastest CPU core, delivering the industry’s best single-threaded performance, and dramatically faster multithreaded performance. The GPUs build on the breakthrough graphics architecture introduced in the previous generation, with faster cores and a 2x faster ray-tracing engine. M4 Pro and M4 Max enable Thunderbolt 5 for the Mac for the first time, and unified memory bandwidth is greatly increased — up to 75 percent. Combined with a Neural Engine that’s up to 2x faster than the previous generation and enhanced machine learning (ML) accelerators in the CPUs, the M4 family of chips brings incredible performance for pro and AI workloads.

[…]

M4 Pro features an up to 14-core CPU consisting of up to 10 performance cores and four efficiency cores. It’s up to 1.9x faster than the CPU of M1 Pro, and up to 2.1x faster than the latest AI PC chip. The GPU features up to 20 cores for graphics performance that is 2x that of M4, and up to 2.4x faster than the latest AI PC chip. […] M4 Pro supports up to 64GB of fast unified memory and 273GB/s of memory bandwidth, which is a massive 75 percent increase over M3 Pro and 2x the bandwidth of any AI PC chip.

[…]

M4 Max is the ultimate choice for data scientists, 3D artists, and composers who push pro workflows to the limit. It has an up to 16-core CPU, with up to 12 performance cores and four efficiency cores. It’s up to 2.2x faster than the CPU in M1 Max and up to 2.5x faster than the latest AI PC chip. The GPU has up to 40 cores for performance that is up to 1.9x faster than M1 Max and up to an astounding 4x faster than the latest AI PC chip. […] M4 Max supports up to 128GB of fast unified memory and up to 546GB/s of memory bandwidth, which is 4x the bandwidth of the latest AI PC chip.

The RAM ceiling for the Pro chip has increased from 36 GB to 64 GB, but for the Max it’s unchanged at 128 GB.

Here’s a summary of the cores situation:

RegularProMax
M14p/4e8p/2e8p/2e
M24p/4e8p/4e8p/4e
M34p/4e6p/6e12p/4e
M44p/6e10p/4e12p/4e

So this seems like a bit of a return to form, where the Pro is closer to the Max, and the Max is mostly attractive for GPU performance and RAM capacity, rather than the CPU. (And you need a $900 BTO option to get those extra 2 cores on the M4 Max.)

Previously:

Update (2024-10-31): Andrew Cunningham:

Because Apple staggered its product and chip announcements, we’ve gathered some basic specs from all versions of the M4, M4 Pro, and M4 Max to help compare them to the outgoing M2 and M3 chip families, including the slightly cut-down versions that Apple sells in the cheaper new Macs. We’ve also rounded up some of Apple’s performance claims, so people with older Macs can see exactly what they’re getting if they upgrade (Apple still likes to use the M1 as a baseline, acknowledging that the year-over-year gains are sometimes minor and that many people are still getting by just fine with some version of the M1 chip).

Hartley Charlton:

So how do the three latest-generation Apple silicon chips compare and which should you choose?

Update (2024-11-04): AppleLeaker (via Hacker News):

Apple’s M4 Max is the first production CPU to pass 4000 Single-Core score in Geekbench 6. The M4 Max is faster than the M2 Ultra in almost every way (90% as powerful as M2 Ultra in GPU). Simply incredible.

Joe Rossignol (Hacker News):

Impressively, the results that are available so far show that the highest-end M4 Pro chip is faster than the highest-end M2 Ultra chip in terms of peak multi-core CPU performance.

I’m not sure how the benchmark is constructed, but that’s surprising given that the M2 Ultra has 10 more cores.

Joe Rossignol:

Based on the Metal scores that are available so far, the M4 Pro and M4 Max are up to around 40% and 25% faster for graphics than the M3 Pro and M3 Max chips, respectively.

Joe Rossignol:

The first Geekbench 6 benchmark results for the high-end M4 Max chip with a 16-core CPU surfaced today, and they show that the chip is up to 25% faster than the high-end M2 Ultra chip with a 24-core CPU in terms of peak multi-core CPU performance.

[…]

As we mentioned in our previous reporting, you can now purchase a Mac mini with a 14-core M4 Pro for $1,599 in the U.S. and get similar to faster peak performance than a Mac Studio with the 24-core M2 Ultra, a configuration that starts at $3,999.

Howard Oakley:

As these are complicated by sub-variants and binned versions, I have brought the details together in a table.

[…]

I’ve been looking to replace my original Mac Studio M1 Max. As it looks likely that an M4 version of the Studio won’t be announced until well into next year, I’m taking the opportunity to shrink its already modest size to that of a new Mac mini. What better choice than an M4 Pro with 10 P and 4 E cores and a 20-core GPU, and the optional 10 Gb Ethernet?

Update (2024-11-11): Howard Oakley (Hacker News):

All CPU cores are arranged in clusters of up to 6. All cores within any given cluster share L2 cache, and are run at the same frequency (clock speed). The Base M4 has a single cluster of 4 P cores, while the Pro and Max have two clusters of 5 and 6 cores respectively.

[…]

Threads are normally allocated by macOS to an available P core when their designated Quality of Service (QoS) is higher than 9 (Background), for example when using Dispatch, formerly branded Grand Central Dispatch (GCD). Running threads may also be moved periodically between P cores in the same cluster, and between clusters. Previous M-series chips appear to move threads less frequently, and may leave them to run to completion after several seconds on the same core, but threads appear to be considerably more mobile when running on M4 P cores.

Andreas Hegenberg:

The only benchmark that matters to me: Clean build of BetterTouchTool on M1 Max: 182s, on M4 Max: 99s (in general building seems significantly slower on Xcode 16 than on older Xcode versions)

Previously:

Update (2024-11-13): Sherief, FYI:

Apple has the best CPU scheduler and the only one that takes efficiency (perf/watt) and thermal headroom into account that I know of. Incredible work really.

Howard Oakley:

All virtualised threads are treated by the host as if they are running at high Quality of Service (QoS), so are preferentially allocated to P cores, even though their original thread may be running at the lowest QoS. This has the side-effect of running virtual background processes considerably quicker than real background threads on the host.

Andreas Osthoff:

The M3 generation already offered extremely good single-core performance, leaving the competition from AMD, Intel and Qualcomm in the dust. Only Intel’s new desktop processor, the Core Ultra 9 285K, was on a comparable level to the M3 SoCs. Apple has stepped up its game even more with its new M4 processors, further widening the gap massively. The P-cores’ maximum clock rate, which is about 500 Mhz higher, results in a performance boost of over 20% compared to the M3 models.

[…]

Depending on the test, its lead over the old M3 Pro with 12 cores was between 47-57% and, as a result, the new M4 Pro is on par with the old full M3 Max. This is a considerable increase in performance and, especially within the 14-inch field, the new M4 Pro faces no competition—neither from AMD, Qualcomm nor Intel. The only exception in this case is the Ryzen AI 9 HX370 inside the Asus TUF A14, which can permanently consume 80 watts and performed slightly better in the CB-R23 test.

Update (2024-11-18): Howard Oakley (Hacker News):

E cores running low QoS threads at close to minimum frequency take about four times as long, 38.5 seconds, but use less than 45 mW power per thread. Total energy used to complete one thread is therefore over 23 J when run on P cores, and less than 1.7 J when run on E cores. E cores therefore use only 7% of the energy that P cores do performing the same task.

Update (2024-11-22): Howard Oakley:

In this series I concentrate on much narrower concepts of performance in CPU cores, to provide deeper insight into topics such as core types and energy efficiency. This article examines the in-core performance of P and E cores, and how they differ.

[…]

P core frequencies have increased substantially since the M1. If we set that as 100%, M3 P cores run at around 112-126% of that frequency, and those in the M4 at 140%.

Update (2024-11-25): Howard Oakley:

This article tries to estimate the cost in terms of power and energy of running identical tests on M4 P and E cores, and thereby provide insight into some of the most distinctive features of Apple silicon, and their benefits.

Update (2024-11-27): Howard Oakley:

From early work by Dougall Johnson on the M1, it has been known that some of the functions in Apple’s vast Accelerate maths libraries can run code on the AMX. Thanks to the guidance of Maynard Handley, a year ago I concluded that one of those is the vDSP_mmul function in the vDSP sub-library. This article reports tests of that function in a Mac mini M4 Pro running Sequoia 15.1.1, leads on to an explanation of previous results using floating point and NEON tests, and considers the effects of Power Modes.

Update (2024-12-03): Howard Oakley:

When running on Apple silicon Macs, macOS modulates ‘cluster HW active frequency’ of P cores, limiting frequency to below maximum when cluster total active residency exceeds 100%.

Although small in M1 variants, this is most prominent in M4 variants, where a total active residency of 300% may reduce cluster frequency to 87% of maximum.

Frequency limitation is most probably part of a pre-emptive strategy in thermal management.

Update (2024-12-06): Howard Oakley:

For the purposes of this article, I’ll consider a single thread that macOS is ready to load onto a CPU core for execution. For that to happen, five decisions are to be made:

  • which type of core, P or E,
  • which cluster to run it in,
  • which core within that cluster,
  • what frequency to run that cluster at,
  • the mobility of that thread between cores in the same cluster, and between clusters (when available).

1 Comment RSS · Twitter · Mastodon


I'm hoping that Apple is finally getting their chip development spun around in the 'correct' (IMHO) orientation, where the fastest chips come first, with the advancements trickling down to the base chips. That's what's best matches demand-to-yield supply, pricing, and customer loyalty. It also better matches with Apple's narrative that they make the BEST computers. At… all… times. Shipping M1 MacBook Airs before Pros was not a good look, in my view, because it communicated that Apple was prioritizing profit over loyalty/engineering. It was a stunning PR move/victory, however, I can admit. But we're now 4 cycles into this and the Mac mini is besting the Mac Studio and the Mac Pro, which are both so stale as to be laughable. I'm hoping to see this finally ironed out by the next generation, since it -seems- it is the direction they're headed.

I'd also like to see Apple 'commoditize' their offerings a bit better. There's little reason that the iMac M4 and Mac mini m4 aren't using the same basic motherboard… in fact, throw the MacBook Air M4 into the mix too. Then they could all get updated at the same time, and Apple could manage 'supply constraint' on demand. For that matter, throw the iPhone Pro and the iPad mini and iPad into that group, though following 6 months behind. The Mac Studio and Mac Pro (and perhaps Mac mini Pro, at the low end) should share a base motherboard design. We're at the point where so much is on the SoC, I really can't see too much differentiation beyond the taillight fins, color, and chromed grille parts.

(And, once again, I'm going to beat the drum on my concept of the iMac as a dual-duty PC *and* external screen. The Studio Display could bump to $2K if it included an M4 or M4 Pro, and would sell like hotcakes as an "iMac Studio". Then you'd effectively have a 24-inch display and a 27- or 30-inch "smart" display, with Thunderbolt, available, AND when you plugged two together with a Mac Studio Max or Mac Pro Ultra would effectively be a 3 or 4 core MONSTER LLM cluster. This is a no-brainer, folks… Tim. This is the future. Apple could have done it 2 years ago. Why they haven't is a complete mystery to me and the failure to 'innovate' with what's obvious makes their engineering team look pretty dumb in my eyes. I'm assuming "marketing" is the problem… as it has always been at Apple.)

Leave a Comment