Monday, April 14, 2025

LLaMA Gaming AI Benchmarks

Kylie Robison (via Hacker News, Slashdot):

Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash “across a broad range of widely reported benchmarks.”

[…]

The achievement seemed to position Meta’s open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google. Then, AI researchers digging through Meta’s documentation discovered something unusual.

In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn’t the same as what’s available to the public. According to Meta’s own materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality,” TechCrunch first reported.

First Google demo shenanigans, then Apple, and now Meta.

Previously:

3 Comments RSS · Twitter · Mastodon


Laurent Giroud

This is interesting on two counts: first this highlights how those firms are essentially clueless as to how generally improve their models so they can be controlled in more programmatic manners (which, by construction, they cannot be), and second, they are willing to take the risk to be caught pants down, lying in public about it.

So we have essentially two levels of incompetence: one semi-technical and one purely executive/communication. The latter of which raises even more moral questions but keep in mind that companies are amoral (alas).

I think they are essentially symptoms of the same underlying issue though: executives are so far from having a technical understanding of the issues at hand here that they seem to be essentially wishing their teams to make the LLM inherent-to-their-construction problems disappear and are willing to fabricate evidence in the meantime.
This would already be an important problem but there could be a second one at play here : one may wonder if those executives are getting clear feedback from engineers that LLMs’s lack of controllability are inherent to their very structure. Absent this feedback they would not see any other choice but to march forward (into the wall, alas).
Given how much AI hype there is, even in engineering groups, it is not even clear many even dare expressing the thought that other approaches are needed for the expected feature set (natural language, symbolic reasoning, safety, absence of hallucinations).

I am with John Siracusa here: a step back is needed but it’s not clear there is any awareness of that necessity at any decision level in those companies. And the fact they are willing to be caught lying about it makes one think they don’t really have a way out at the moment.


The unmodified release version of Maverick (Llama-4-Maverick-17B-128E-Instruct) ranks 32nd on LMArena.

This is interesting because Meta tried to make the model less "biased" toward the left and refuse fewer questions. If the hypothesis that LLMs are biased toward the left is correct, removing that bias should improve the model, particularly in this type of benchmark. The results don't seem to bear that out.


I found this bloggpost about how recent ai gains feel like bullshit rather spot on. I get this weird feeling like I'm being gaslit all the time when people wax lyrically about how fantastic AI is. It's nice and useful I think the 20 dolalrs a month my company pays for my ChatGPT plus sub are money well spent.

But at the same time I honestly don't see that much of a difference over the last two years. It's more fine grained and tighter locked into the large hump of the bell curve. Results look more like the thing they're copying. Anyways... worth a read:

https://www.lesswrong.com/posts/4mvphwx5pdsZLMmpY/recent-ai-model-progress-feels-mostly-like-bullshit

Leave a Comment