{"id":37749,"date":"2022-11-29T16:11:50","date_gmt":"2022-11-29T21:11:50","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=37749"},"modified":"2022-12-14T15:02:02","modified_gmt":"2022-12-14T20:02:02","slug":"why-rosetta-2-is-fast","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2022\/11\/29\/why-rosetta-2-is-fast\/","title":{"rendered":"Why Rosetta 2 Is Fast"},"content":{"rendered":"<p><a href=\"https:\/\/dougallj.wordpress.com\/2022\/11\/09\/why-is-rosetta-2-fast\/\">Dougall Johnson<\/a> (<a href=\"https:\/\/news.ycombinator.com\/item?id=33533132\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/dougallj.wordpress.com\/2022\/11\/09\/why-is-rosetta-2-fast\/\">\n<p>Generally translating each instruction only once has significant instruction-cache benefits &#x2013; other emulators typically cannot reuse code when branching to a new target.<\/p>\n<p>[&#8230;]<\/p>\n<p>Given these constraints, the goal is generally to get as close to one-ARM-instruction-per-x86-instruction as possible, and the tricks described in the following sections allow Rosetta to achieve this surprisingly often. This keeps the expansion-factor as low as possible. For example, the instruction size expansion factor for an sqlite3 binary is ~1.64x (1.05MB of x86 instructions vs 1.72MB of ARM instructions).<\/p>\n<p>[&#8230;]<\/p>\n<p>All performant processors have a return-address-stack to allow branch prediction to correctly predict return instructions.<\/p>\n<p>Rosetta 2 takes advantage of this by rewriting x86 <strong>CALL<\/strong> and <strong>RET<\/strong> instructions to ARM <strong>BL<\/strong> and <strong>RET<\/strong> instructions (as well as the architectural loads\/stores and stack-pointer adjustments). This also requires some extra book-keeping, saving the expected x86 return-address and the corresponding translated jump target on a special stack when calling, and validating them when returning, but it allows for correct return prediction.<\/p>\n<p>[&#8230;]<\/p>\n<p>The Apple M1 has an undocumented extension that, when enabled, ensures instructions like <strong>ADDS<\/strong>, <strong>SUBS<\/strong> and <strong>CMP<\/strong> compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty.<\/p>\n<\/blockquote>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2021\/03\/04\/reverse-engineering-rosetta-2\/\">Reverse-Engineering Rosetta 2<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2020\/11\/16\/performance-of-rosetta-2-on-apple-m1\/\">Performance of Rosetta 2 on Apple M1<\/a><\/li>\n<\/ul>\n\n<p id=\"why-rosetta-2-is-fast-update-2022-12-14\">Update (2022-12-14): <a href=\"https:\/\/eclecticlight.co\/2022\/12\/10\/explainer-rosetta-2\/\">Howard Oakley<\/a>:<\/p>\n<blockquote cite=\"https:\/\/eclecticlight.co\/2022\/12\/10\/explainer-rosetta-2\/\">\n<p>Whenever possible Rosetta completes its translation well before the code is required to be run. For some apps, this may occur when they&rsquo;re installed on the Mac, but it can also be delayed until launch time.<\/p>\n<\/blockquote>","protected":false},"excerpt":{"rendered":"<p>Dougall Johnson (Hacker News): Generally translating each instruction only once has significant instruction-cache benefits &#x2013; other emulators typically cannot reuse code when branching to a new target. [&#8230;] Given these constraints, the goal is generally to get as close to one-ARM-instruction-per-x86-instruction as possible, and the tricks described in the following sections allow Rosetta to achieve [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2022-11-29T21:11:52Z","apple_news_api_id":"5ec98177-aae6-441d-8e64-934b2a594ec4","apple_news_api_modified_at":"2022-12-14T20:02:06Z","apple_news_api_revision":"AAAAAAAAAAAAAAAAAAAAAg==","apple_news_api_share_url":"https:\/\/apple.news\/AXsmBd6rmRB2OZJNLKllOxA","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[2],"tags":[2014,1941,770,733,30,2223,138,1025],"class_list":["post-37749","post","type-post","status-publish","format-standard","hentry","category-technology","tag-apple-m1","tag-arm-macs","tag-assembly-language","tag-emulator","tag-mac","tag-macos-13-ventura","tag-optimization","tag-rosetta"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/37749","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=37749"}],"version-history":[{"count":3,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/37749\/revisions"}],"predecessor-version":[{"id":37903,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/37749\/revisions\/37903"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=37749"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=37749"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=37749"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}