Monday, November 23, 2020

M1 Memory and Performance

The M1 is apparently a multi-die package that contains both the actual processor die and the DRAM. As such, it has a very high-speed interface between the DRAM and the processors. This high-speed interface, in addition to the absolutely humongous caches, is key to keeping the various functional units fed. Memory bandwidth and latency are probably the determining factors for many of today’s workloads, with a single access to main memory taking easily hundreds of clock cycles and the CPU capable of doing a good number of operations in each of these clock cycles. As Andrew Black wrote: “[..] computation is essentially free, because it happens ‘in the cracks’ between data fetch and data store; ..”.
[…]
The benefit of sticking to RC is much-reduced memory consumption. It turns out that for a tracing GC to achieve performance comparable with manual allocation, it needs several times the memory (different studies find different overheads, but at least 4x is a conservative lower bound). While I haven’t seen a study comparing RC, my personal experience is that the overhead is much lower, much more predictable, and can usually be driven down with little additional effort if needed.
So Apple can afford to live with more “limited” total memory because they need much less memory for the system to be fast. And so they can do a system design that imposes this limitation, but allows them to make that memory wicked fast. Nice.

Mike:

The memory bandwidth on the new Macs is impressive. Benchmarks peg it at around 60GB/sec–about 3x faster than a 16” MBP. Since the M1 CPU only has 16GB of RAM, it can replace the entire contents of RAM 4 times every second.

[…]

Some say we’re moving into a phase where we don’t need as much RAM, simply because as SSDs get faster there is less of a bottleneck for swap. […] However, with the huge jump in performance on the M1, the SSD is back to being an order of magnitude slower than main memory.

So we’re left with the question: will SSD performance increase faster than memory bandwidth? And at what point does the SSD to RAM speed ratio become irrelevant?

Graham Lee:

And that makes me think that a Mac would either not go full NUMA, or would not have public API for it. Maybe Apple would let the kernel and some OS processes have exclusive access to the on-package RAM, but even that seems overly complex (particularly where you have more than one M1 in a computer, so you need to specify core affinity for your memory allocations in addition to memory type). My guess is that an early workstation Mac with 16GB of M1 RAM and 64GB of DDR4 RAM would look like it has 64GB of RAM, with the on-package memory used for the GPU and as cache. NUMA APIs, if they come at all, would come later.

Previously:

Update (2020-11-25): David Smith:

this further improvement is because uncontended acquire-release atomics are about the same speed as regular load/store on A14

Juli Clover:

The video includes a series of benchmark tests, ranging from Geekbench and Cinebench to RAW exporting tests. Geekbench and Cinebench benchmarks didn’t demonstrate a difference in performance between the 8GB and 16GB models, but other tests designed to maximize RAM usage did show some differences.

A Max Tech Xcode benchmark that mimics compiling code saw the 16GB model score 122 compared to the 136 scored by the 8GB model, with the lower score being better.

Populus:

Beware of the swap disk space!

In most of the benchmarks performed on 8GB M1 machines, if Activity Monitor is shown, the swap space usage is always between 2,5GB and 4GB or even more. In my 10 years of being a mac user, I’ve never seen such big swap space being used unless I’m stressing my machine heavily, and that usage may be aging your SSD.

Apple M1 Automatic Reference Counting (ARC) Mac macOS 11.0 Big Sur Programming RAM

8 Comments RSS · Twitter

Old Unix Geek

November 23, 2020 8:04 PM

I'm finding the argument that RC means Macs need less memory not to be particularly credible.

In the mobile space, sure, Android uses Java, which uses GC.

But on desktops one has native apps and non-native apps. Non-native apps run on both platforms, so they'll be using a similar amount of memory on both. That leaves native apps. Windows presumably ships more than zero C# and .Net apps, but most apps are, AFAIK, still written in C/C++ which don't use garbage collection. So RC provides no advantage there. Most native apps in Linux are also C or C++. Again, no advantage.

That leaves binary sizes... but ARM64 is slightly larger than x86_64...

RAM being an order of magnitude faster than SSDs makes RAM vital to Macs' performance. Instead of inventing arguments as to why 16Gb is suddenly fine, it seems more likely to me that the reason is much more prosaic: Apple can't buy 16GB RAM parts from 3rd party manufacturers, only 8GB parts, and 2 x 8 = 16.

Sam

November 24, 2020 12:54 AM

@Old: I'm not sure what you're talking about.

Windows has at least a couple apps written in *JavaScript*. Besides, "more than zero" GC runtimes is enough to add memory pressure to the entire system. You don't need "most apps" to be written in a GC language. The Titanic didn't need *most* pieces of the hull to have holes in them to add water pressure to the inside of the ship. "More than zero" holes was enough.

Linux is even easier to analyze, since we can just look at the source code. Many popular desktop apps are written in C# (Mono) or even Python (PyGTK).

As for binary sizes, ARM64 is smaller than x86-64 for every executable I've ever seen -- from 5% to 20% smaller. iMovie's binary, for example, is 7% smaller for ARM64 than x86-64. (That's not unique to the Mac, either. Every Debian package I've checked is also smaller for ARM64.) What do you mean by ARM64 is "slightly larger"? Do you have an example?

Ksec

November 24, 2020 2:45 AM

The previous MacBook ( LPDDR3 or DDR4 ) had ~33GB/s Memory Bandwidth,They are dual channel configured, not 20GB. So it would be something less than 2X.

And my gut feeling and experience is that most workload on CPU aren't Memory Bandwidth Intensive. For CPU it is more more latency sensitive hence the usage of SRAM. You will still need to more memory bandwidth with increase of core count. But it is not like x86 8 Core with 16 Threads ( i.e 16 Core in Apple's term ) are starved or bandwidth. Those usage goes to GPU and NPU which are *very* memory intensive.

The next generation LPDDR5, ( or more like current leading edge since it is already being used in Samsung's Phone ) goes up to 50GB/s or 100GB/s Dual Channel. Impressive on first sight, but still isn't quite good enough if Apple doubles if GPU config on M1 again for its 16" Pro Devices. I keep wondering if Apple will do Quad Memory Channel.

Chris

November 24, 2020 7:26 AM

From Paul Haddad of Tapbots fame:

> There’s also no magic memory, if you needed > 16GB before, you’ll need it now. This machine does swap under memory extensive operations and when it does, performance takes a dump. Everything, including UI, will start dragging.

https://twitter.com/tapbot_paul/status/1329829093731348489?s=20

Old Unix Geek

November 24, 2020 10:59 AM

Besides, "more than zero" GC runtimes is enough to add memory pressure to the entire system

That statement is simply wrong. Each app uses its own memory pool. If I have 2 Haskell programs running with small heaps, they do not "hole the Titanic" unless they have a space-leak. If they use less memory intensive algorithms, they'll use less memory than a more poorly written C++ program (space/time tradeoff). Your argument would imply that simply running a single web-browser would be sufficient to kill all performance. If M1 suffered from that, it would be a useless chip.

In over 40 years of using Linux, I've never run, downloaded or anything else a Mono App. Just because Miguel de Icaza is good at getting attention doesn't mean Mono widely used. Python uses ... reference counting which is also known as "RC" and not garbage collection, which is being claimed to be "so efficient"

As to ARM64's size: http://web.eece.maine.edu/~vweaver/papers/iccd09/ll_document.pdf Thumb was closer to x86, but we're not doing Thumb anymore. (Instruction density is also kind of the entire point of CISC). Obviously it also depends on what the compilers do for each architecture... I haven't compared those directly either on Mac or Linux, and the trade offs made for optimisation will make a difference. If you actually go deeper and figure out what the differences are (binaries are code, but also data, I'd be curious to hear what you find out).

Btw, what Windows apps shipped with Windows are written in JS? I haven't dug deeply into Windows internals since XP.

@Chris

Thanks for the confirmation.

GzarJH

November 24, 2020 1:44 PM

Ksec. I'm sure you know, but there is a large difference between the memory bandwidth and latency.

Memory bandwidth is almost like marketing gibberish, that is only really achievable in perfect, lab type conditions, Latency is really where it's at.

Sam

November 24, 2020 3:11 PM

@Old: There's an ocean of difference between "add memory pressure" and "kill all performance" / "a useless chip". One does not imply the other. That's a perfect example of an Appeal to Extremes.

I'm not sure why you're making jabs at Miguel de Icaza for "getting attention" for software that's not "widely used", simply because you have not used it. We could say the same about Steve Jobs: most people don't use Macs, either. But many do. GC'd desktop apps are not uncommon on Linux systems.

Python uses both RC and GC. The GC is optional ("gc.disable()"), if you're certain you have no cycles, but it's on by default. I've never seen any Python program disable it except in tests or benchmarks.

For code size, I'm looking at actual Mac applications I downloaded this week. You're citing an old paper that looks at the size of some Linux benchmarks, which it says were intentionally picked to be small enough to compare against hand-written assembly. The total size of all code is around 1000 *bytes*. The functions they're using (like LZSS) look nothing like typical application code (one malloc/memset of a fixed-size buffer, looping with bit-twiddling, only static functions, and none with more than 2 params). The updated paper still claims to use gcc 4.1/4.2 (2006/2007), though that version doesn't support ARM64 (didn't exist at the time), so I'm not convinced they're even comparing the same compiler across architectures.

Obviously, if your test code only uses a couple registers, the benefits of having 31 GPRs won't be visible. Personally, I write big (>1000 byte) programs, so I think having lots of registers is great!

I believe Raymond Chen said that he knew at least Weather and News were JS apps as of Windows 8. It looks like the Start Menu also uses JS for some dynamic content, but I'm not sure. Also, Visual Studio is at least partly written in C#.

P.S., your claim (i.e., that RC doesn't mean Macs need less memory) is orthogonal to Chris's/Paul's statement. Paul's comparison is ARM-v-x86 (hardware), not RC-v-GC (software).

Old Unix Geek

November 24, 2020 5:26 PM

Well... you were the one that brought up the Titanic. A single hole would be enough to sink it, and all that. I think that was the original Appeal To Extremes in this thread.

Seriously, what Mono software do you use? I'm not aware of any widely used Linux software based on Mono. On their vanity list, the only thing I even recognise is Unity 3D https://www.mono-project.com/docs/about-mono/showcase/software/ which doesn't fall under my definition of "desktop software".

RC way dominates GC in every program I've written in Python... and yes, I've written complex 50k line programs in it. I've basically never even noticed its GC, and had kind of forgotten about it entirely. That's totally different from Haskell where GC is very noticeable. (Haskell is a lazy language, so all sorts of things end up on the heap. It's been a long time since I last measured it, but I don't remember its overhead going over 2x. Even so, that's annoying and I used static allocation for static long-lived data.). Based on the following https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172 it seems that GC costs Python approximately a 25% memory overhead... which is not good, but no way as bad as the claimed 4x.

As to ARM's code density, we'll just have to differ. A tight codec loop will happily use every register it can get, and also fit into 1024 bytes, and use most of the CPU time (been there... done that). Of course I very much agree with you that having more registers makes life easier. Back on the 68000 I sometimes had to use a7 (stack pointer) in user mode, because 15 register (d0-d7/a0-a6) weren't enough. Luckily interrupts used a7 in supervisor mode, which was a different register (ssp versus usp), so they didn't hose the machine. Moving to 32 bit x86 and its 8 puny registers was a real pain. ARM32 was much more pleasant. I have not yet needed to write any ARM64 myself. Compilers make this kind of measurement rather arbitrary, which is why comparing hand-written assembly makes more sense to me. But even so, even a 5% or 20% size advantage for ARM wouldn't make it likely that "16Mb is enough on M1 but not on x86".

Thanks for the Windows information. It would make sense for Visual Studio to now use C#. I wonder which is less annoying at this point: XCode or Visual Studio. I am not particularly fond memories of either of them.

At this point I still don't find the argument that Macs need less memory because of RC particularly credible... Whatever effects it has will be drowned out by other considerations (what does the windowing library do internally, how is the program using memory for processing, etc. It's hard to write code that minimises memory usage, and therefore rare).

M1 Memory and Performance

8 Comments RSS · Twitter

Leave a Comment