Monday, November 16, 2020

Performance of Rosetta 2 on Apple M1

Frank McShan (tweet, Hacker News):

The new Rosetta 2 Geekbench results uploaded show that the M1 chip running on a MacBook Air with 8GB of RAM has single-core and multi-core scores of 1,313 and 5,888 respectively. Since this version of Geekbench is running through Apple’s translation layer Rosetta 2, an impact on performance is to be expected. Rosetta 2 running x86 code appears to be achieving 78%-79% of the performance of native Apple Silicon code.

Despite the impact on performance, the single-core Rosetta 2 score results still outperforms any other Intel Mac, including the 2020 27-inch iMac with Intel Core i9-10910 @ 3.6GHz.

Chris Randall:

On the whole, our general opinion is that as a producer you won’t really notice (or even be able to tell) whether a plugin or host is running native ARM or in Rosetta 2. The CPU load should be more or less the same; the ARM version may be slightly lower, but this is difficult to measure consistently.

Previously:

Update (2020-11-19): Joe Rossignol:

Microsoft this week indicated that when launching any of its Mac apps for the first time on Apple Silicon Macs, the apps will bounce in the dock for approximately 20 seconds while the Rosetta 2 translation process is completed, with all subsequent launches being fast.

Brendan Shanks:

We’re making it official: @codeweavers CrossOver/Wine runs 32- and 64-bit Windows apps/games on Apple Silicon Macs! And it works today!

Big thanks to the Rosetta folks at Apple and everyone at CodeWeavers for their hard work on this.

Colin Cornaby:

Stuff like this makes me hope that Rosetta sticks around in some form for a very long time. PowerPC wasn’t a bit industry force that required long term compatibility. But the x86 platform will be with us for a long while, even if Apple leaves it.

Update (2020-11-27): Robert Graham:

So Apple simply cheated. They added Intel’s memory-ordering to their CPU. When running translated x86 code, they switch the mode of the CPU to conform to Intel’s memory ordering.

With underlying architectural issues ironed out, running x86 code simply means translating those instructions to the Arm equivalent. This is very efficient and results in code that often runs at the same speed.

10 Comments RSS · Twitter

Does M1 come with a complementary copy of Geekbench, or why is everyone only running that benchmark? /s

Old style emulators emulate every feature of a CPU, interpreting every instruction one at a time. Most of the time, that's a waste of time because the instruction that follows only cares about some of the results of its preceding instruction (condition codes, and the like are often ignored). Rosetta on the other hand converts instructions into a sequence of abstract operations specifying what each instruction does, and then discards any operation whose result is not used. The innovation was finding a representation that allowed this to be done very quickly. Thus Rosetta produces tighter assembly, but it's not as tight as what a good compiler produces, let alone handcrafted assembly.

Obviously we are supposed to conclude that M1 is wicked fast. But there are other explanations: perhaps the older CPUs were rather slow, or perhaps Geekbench is not particularly well optimized for its target CPUs, or perhaps Geekbench is more memory/cache/io limited than CPU limited.

Looking at this: https://www.geekbench.com/doc/geekbench5-cpu-workloads.pdf it seems like Cache misses are quite a big component of many of their tests, which suggests most of Geekbench is indeed memory/cache limited. Something like their Raytracing test or their n-body problem would probably give us a better idea of M1's CPU power.

My tentative conclusion so far is that M1 has a great memory subsystem, partly due to having RAM in the same package as the CPU, but it's unclear how fast a CPU it is if everything fits in cache. And it seems to me everyone is using Geekbench because it shines the best light on M1. For most users these distinctions won't matter that much, but it will matter for certain tasks.

@old unix geek.

Geekbench anonymously uploads your results to a public database for the world to see, unlike some of the other standardized benchmarks. Also, it’s a short test that doesn’t take long to run, which makes it attractive to journalists writing reviews under a deadline. The embargo on reviews should lift tonight or tomorrow, and we’ll start seeing more thorough test results.

@Old: We're starting to see the first Cinebench R23 ("thermal torture test") results, and it's still holding up quite well to contemporary Intel and AMD chips.

Do you have reason to be pessimistic about the performance of the M1? It's got 16 billion transistors in a 5nm process. It's not like they're just making up wild performance numbers here.

@Glaurung: thanks!

@T:

I'm not sure why you think I am being pessimistic. In fact what I'm saying would be optimistic in the long term.

First, a correction: all 16 billion transistors are not dedicated to the M1 CPU. Most of the transistors will be used by the caches, the GPU, the Neural Engine and other peripherals. I don't know how many transistors are dedicated to the CPU. Apple gave the impression that 16 billion transistors is amazing, but for instance an AMD-EPYC chip contains 39.54 billion transistors.

Memory latency is a key architectural problem. That's why we spend so many transistors on caches. Say you have a linked list. Each time you dereference the next pointer, its target cannot be predicted. So if the target is not in the cache, the CPU must wait until the memory is loaded. In most PCs, where memory is a separate component, this takes on the order of 300 cycles... which is an eternity on CPUs which execute multiple instructions per cycle. The same thing happens with indirect jumps. Even an out-of-order CPU cannot hide that latency. Intel's hyperthreading was an attempt to keep the CPU busy while memory loads stall. This is also one of the fundamental issues that cause GPU architectures to differ significantly from CPU architectures. If the memory is on chip however, the number of cycles falls dramatically. Now your CPU isn't stalling as much, which means even if it's doing less per cycle, it will come out ahead in most cases. Since memory is on chip, I expect this to be the case. In general that will produce better results, except for workloads that don't access memory much.

Another datapoint is that Apple's silicon's performance has been increasing much more steadily and rapidly than x86 performance. One could either attribute that to the 5nm shrink and higher frequency, to their engineers being so much more skilled than the engineers of AMD or Intel, or to some superiority of ARM's architecture. My understanding is that M1 isn't clocked very high. I also very much doubt that Apple has engineers who are so much better than AMD or Intel: the reason x86 performance has been growing slowly is that all the low hanging fruit has been taken. And although some aspects of ARM's architecture such as fixed length instructions does simplify things, these differences do not strike me as sufficient to explain the difference in performance increase. The fact memory is on chip for M1, but not for Ryzen, seems a much more plausible explanation to me.

Finally I mentioned this is actually an optimistic viewpoint. Why? Because it means that there's still some low hanging fruit Apple can exploit before they encounter the same difficulties as AMD/Intel have.

This reminds me of the position Intel was in less than a decade ago: It's easy to have a performance advantage if you're a process node ahead of everyone else. But it's looking like jettisoning legacy code and backwards compatibility (32-bit) is helping too.

Then again, SoC design has been the best-functioning part of Apple for many years, so it's not too surprising that they've created such a blazing chip once they put their minds to it. Though it's a shame all that power is locked up by software that feels increasingly bug-ridden. Here's hoping the M1 provides some incentive for Intel and AMD to start making high-performance ARM designs, and for Microsoft to get their x86 translator up to speed.

"Also, it’s a short test that doesn’t take long to run"

This also means it will give a huge comparative advantage to systems with poor cooling solutions.

"Do you have reason to be pessimistic about the performance of the M1?"

Not pessimistic, it's more that some of Apple's claims looked a bit hyperbolic. Macrumors posted Cinebench scores that show the M1 about on par with a 2019 16-inch MacBook Pro with 2.3GHz Core i9 chip, probably a bit faster in single-core, but slower in multi-core. That is impressive in itself, but not a huge step forward.

Macrumors didn't say how they got the numbers, if this is a first-run result or if this is after the chip has reached thermal stability, which would be interesting to know, since it would indicate if these numbers are at the high end of what we can expect, or if better cooling will produce much better results.

So this came out: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested/4

I was really surprised by 456.hmmer because it does not stress the memory system... until I read that they are compared a binary compiled with enable-loop-distribute (ARM) with one compiled without (x86). Basically the x86 is being made to do a lot more work than the ARM chip, because the latter has been given code that is better vectorised and does not include unnecessary register spills. They admit Ryzen does better that M1 if they enable the same optimisation.

As it is, it's tough to compare 2 different architectures since one is comparing not just two CPUs but also two compilers. But if on top of it one is being given the best possible compiler optimisations and the other isn't... well that doesn't tell us very much. AnandTech should have done better. I'm also curious why 470.ibm is so much better on M1.

Nevertheless, particularly given the power consumption, this does actually look impressive. If Apple's team manages to continue improving performance at the rate they have been so far (and that's quite a big if, because things get a lot harder once the low hanging fruit is used up), AMD will have a real battle on their hands. Non-upgradeable parts with fixed amounts of RAM may be in their future too.

>Nevertheless, particularly given the power consumption,
>this does actually look impressive

Yeah, even with all the caveats, those numbers are fantastic. The fact that Apple is even vaguely in the same ballpark as AMD is better than I expected.

It's pretty good to see competition heating up again, after over a decade of Intel basically coasting along and doing just the bare necessities to get people to update every few years.

>Non-upgradeable parts with fixed amounts of RAM may be
>in their future too.

Ugh.

I don't yet see "competition heating up again". Intel and AMD have already been going at it since the 1980's. x86 isn't getting any simpler. Every architecture has natural limits. There's a reason the M1 isn't just a superfast 6502 or 68k.

- The Wintel side isn't well positioned to switch to a new architecture. Backwards compatibility is their bread and butter, and by all reports Microsoft's x86 emulator isn't great. Intel has tried to switch PC architectures many times in their history. Itanium was actually their most successful attempt.

- AMD could make ARM chips, but they've shown little interest in doing so. Their ARM webpage still says "planned availability is expected ... 2016". And without an OS and apps, who's the market? Linux-only PCs have not exactly seen great sales.

- Linux *is* in a great position with respect to software (there's mostly no need for emulation), but there's nobody offering high-performance chips with any other architecture yet. They can't just ride the desktop-PC performance wave this time. No major PC makers (besides Apple) have offered non-x86 personal computers since the 1990's. It'll be great if/when they get Linux running on the M1, but again, given the number of desktop Linux users, it's not clear how that will change anything.

I love the idea of PC CPU competition, but I just don't see where it's going to come from. I think it's more likely Chromebooks or video game consoles will grow into this space, as they don't have the strong backwards-compatibility requirements that Intel/AMD/Microsoft do.

"Intel and AMD have already been going at it since the 1980's."

The last time AMD was competitive with Intel was in the early to mid 2000s. When Intel introduced the Core 2 CPUs, they basically put AMD out of business, forcing them to sell their fabs. For almost a decade after that, AMD barely made any progress, and as a result, Intel wasn't forced to improve much, either.

One obvious problem here is that AMD's recent advances are in large parts a result of TSMC's efforts (and similarly for Apple, as well). This is bad. We're basically down to three companies still heavily investing in fabs: TSMC, Samsung, and Intel (and I'm not sure how much longer Intel will be part of this group).

We could very well end up in a similar situation to the AMD-Intel one here, where we basically have one company fabricating all high-end CPUs, without any real competition, and no pressure to improve.

Perhaps it would be nice if Apple decided to invest in its own fabrication capabilities.

"Every architecture has natural limits"

x86 is a CPU instruction set. It doesn't dictate the CPU design. x86 CPU design has changed a lot over the past four decades. In fact, no Intel CPU sold in the past decade has executed x86 instructions natively. Many of the things Apple ostensibly does to improve M1 performance would work perfectly well on an Intel system.

Obviously, it would be nice to start out fresh, and performance gains could be made. We wouldn't need the x86 decoder block, for one, which would improve performance by a few percentage points. But x86 is not the primary reason Intel-compatible CPUs haven't made a lot of progress in the last decade, and it's also not the reason they're starting to make progress again now.

"There's a reason the M1 isn't just a superfast 6502 or 68k."

It's because low-power, fast ARM chips were available from a bunch of manufacturers when Apple introduced the iPhone, and because the ARM IP can be licensed easily.

"I love the idea of PC CPU competition, but I just don't see where it's going to come from."

There has been PC CPU competition starting with AMD's introduction of the Zen 2 CPUs, and Apple's M1 is only going to help with this.

Zen 2 CPUs launched in July 2019, Zen 3 in November 2020, with about a 20% performance gain. This is roughly similar to the recent year-over-year performance gains Apple has shown. It's only one data point (two if you include Zen 1), of course, and we'll have to see how the trend continues, but so far, it's not looking that bad for x86 CPUs.

Which is good. Apple making faster CPUs is great. AMD making faster CPUs is great. Intel getting kicked in its butt is great. It's all competition, and it's all good for us.

Now start working on these fabs, Apple.

Leave a Comment