Why Rosetta 2 Is Fast
Dougall Johnson (Hacker News):
Generally translating each instruction only once has significant instruction-cache benefits – other emulators typically cannot reuse code when branching to a new target.
[…]
Given these constraints, the goal is generally to get as close to one-ARM-instruction-per-x86-instruction as possible, and the tricks described in the following sections allow Rosetta to achieve this surprisingly often. This keeps the expansion-factor as low as possible. For example, the instruction size expansion factor for an sqlite3 binary is ~1.64x (1.05MB of x86 instructions vs 1.72MB of ARM instructions).
[…]
All performant processors have a return-address-stack to allow branch prediction to correctly predict return instructions.
Rosetta 2 takes advantage of this by rewriting x86 CALL and RET instructions to ARM BL and RET instructions (as well as the architectural loads/stores and stack-pointer adjustments). This also requires some extra book-keeping, saving the expected x86 return-address and the corresponding translated jump target on a special stack when calling, and validating them when returning, but it allows for correct return prediction.
[…]
The Apple M1 has an undocumented extension that, when enabled, ensures instructions like ADDS, SUBS and CMP compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty.
Previously:
Update (2022-12-14): Howard Oakley:
Whenever possible Rosetta completes its translation well before the code is required to be run. For some apps, this may occur when they’re installed on the Mac, but it can also be delayed until launch time.