{"id":47400,"date":"2025-04-14T13:38:04","date_gmt":"2025-04-14T17:38:04","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=47400"},"modified":"2025-07-24T11:50:26","modified_gmt":"2025-07-24T15:50:26","slug":"llama-gaming-ai-benchmarks","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2025\/04\/14\/llama-gaming-ai-benchmarks\/","title":{"rendered":"LLaMA Gaming AI Benchmarks"},"content":{"rendered":"<p><a href=\"https:\/\/www.theverge.com\/meta\/645012\/meta-llama-4-maverick-benchmarks-gaming\">Kylie Robison<\/a> (via <a href=\"https:\/\/news.ycombinator.com\/item?id=43620452\">Hacker News<\/a>, <a href=\"https:\/\/tech.slashdot.org\/story\/25\/04\/08\/133257\/meta-got-caught-gaming-ai-benchmarks\">Slashdot<\/a>):<\/p>\n<blockquote cite=\"https:\/\/www.theverge.com\/meta\/645012\/meta-llama-4-maverick-benchmarks-gaming\"><p>Over the weekend, Meta dropped two new <a href=\"https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence\/\">Llama 4 models<\/a>: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash &ldquo;across a broad range of widely reported benchmarks.&rdquo;<\/p><p>[&#8230;]<\/p><p>The achievement seemed to position Meta&rsquo;s open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging through Meta&rsquo;s documentation discovered something unusual.<\/p><p>In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn&rsquo;t the same as what&rsquo;s available to the public. According to Meta&rsquo;s own materials, it deployed an <a href=\"https:\/\/x.com\/natolambert\/status\/1908913635373842655\">&ldquo;experimental chat version&rdquo;<\/a> of Maverick to LMArena that was specifically &ldquo;optimized for conversationality,&rdquo; <em>TechCrunch<\/em> first <a href=\"https:\/\/techcrunch.com\/2025\/04\/06\/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading\/\">reported<\/a>.<\/p><\/blockquote>\n<p>First <a href=\"https:\/\/mjtsai.com\/blog\/2023\/12\/11\/googles-gemini\/\">Google<\/a> demo shenanigans, then <a href=\"https:\/\/mjtsai.com\/blog\/2025\/03\/13\/rotten\/\">Apple<\/a>, and now Meta.<\/p>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2025\/04\/10\/how-apple-fumbled-siris-ai-makeover\/\">How Apple Fumbled Siri&rsquo;s AI Makeover<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2025\/03\/25\/please-stop-externalizing-your-costs-directly-into-my-face\/\">Please Stop Externalizing Your Costs Directly Into My Face<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2025\/03\/13\/rotten\/\">Rotten<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2025\/01\/28\/deepseek\/\">DeepSeek<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/12\/16\/gemini-2-0\/\">Gemini 2.0<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/10\/15\/understanding-the-limitations-of-mathematical-reasoning-in-large-language-models\/\">Understanding the Limitations of Mathematical Reasoning in Large Language Models<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/12\/11\/googles-gemini\/\">Google&rsquo;s Gemini<\/a><\/li>\n<\/ul>\n\n<p id=\"llama-gaming-ai-benchmarks-update-2025-04-22\">Update (<a href=\"#llama-gaming-ai-benchmarks-update-2025-04-22\">2025-04-22<\/a>): <a href=\"https:\/\/www.theverge.com\/news\/608188\/google-fake-gemini-ai-output-super-bowl\">Emma Roth<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.theverge.com\/news\/608188\/google-fake-gemini-ai-output-super-bowl\"><p>Google appears to have faked AI output in a commercial set to run during the Super Bowl. The ad shows a business owner using Gemini to write a website description, but the text portrayed as generated by AI has been available on the business&rsquo;s website since at least August 2020, as shown <a href=\"https:\/\/web.archive.org\/web\/20200807133049\/https:\/\/www.wisconsincheesemart.com\/products\/gouda-cheese-smoked\">on this archived webpage<\/a>.<\/p><\/blockquote>","protected":false},"excerpt":{"rendered":"<p>Kylie Robison (via Hacker News, Slashdot): Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash &ldquo;across a broad range of widely reported benchmarks.&rdquo;[&#8230;]The achievement seemed to position Meta&rsquo;s open-weight Llama 4 as a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2025-04-14T17:38:07Z","apple_news_api_id":"4eedf513-2c44-4202-9ff4-fb1da9bae357","apple_news_api_modified_at":"2025-04-22T17:17:35Z","apple_news_api_revision":"AAAAAAAAAAAAAAAAAAAAAA==","apple_news_api_share_url":"https:\/\/apple.news\/ATu31EyxEQgKf9PsdqbrjVw","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[2],"tags":[1351,263,2427,2137],"class_list":["post-47400","post","type-post","status-publish","format-standard","hentry","category-technology","tag-artificial-intelligence","tag-theory","tag-llama","tag-meta"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/47400","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=47400"}],"version-history":[{"count":2,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/47400\/revisions"}],"predecessor-version":[{"id":47451,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/47400\/revisions\/47451"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=47400"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=47400"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=47400"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}