Thursday, January 16, 2025

Putnam-AXIOM Variation

Aryan Gulati et al. (PDF, via Hacker News):

As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated. Therefore, we present the Putnam-AXIOM Original benchmark consisting of 236 mathematical problems from the William Lowell Putnam Mathematical Competition, along with detailed step-by-step solutions. To preserve the Putnam-AXIOM benchmark’s validity and mitigate potential data contamination, we created the Putnam-AXIOM Variation benchmark with functional variations of 52 problems. By programmatically altering problem elements like variables and constants, we can generate unlimited novel, equally challenging problems not found online. We see that almost all models have significantly lower accuracy in the variations than the original problems. Our results reveal that OpenAI’s o1-preview, the best performing model, achieves merely 41.95% accuracy on the Putnam-AXIOM Original but experiences around a 30% reduction in accuracy on the variations’ dataset when compared to corresponding original problems.

So it didn’t “understand” the original problems as well as had been thought.

Previously:

4 Comments RSS · Twitter · Mastodon


Old Unix Geek

This is clearly a better test. If a kid can "solve" a standard set of questions, but can't solve a new similar one, we doubt he understood the problem, and assume he either was given the answer or is applying some strange pattern matching rule that just happened to work... which is what LLMs just so happen to learn.


Can it even be said that LLMs "understand" math and logic problems? I am hardly an expert on machine learning, but it strikes me that the general way that LLMs work isn't really in alignment with what's required to solve a math problem. Math problems require a specific procedure to solve depending on the exact kind of problem, or more broadly logical deduction.

I feel like people keep expecting LLMs to be the mythical Enterprise computer when they're not actually operating in a manner that's necessarily precise or logical.


Old Unix Geek

It's not just logical deduction. There are an infinite number of logical deductions you can make from a set of axioms, most of which are not useful. LLMs learn the patterns that the proofs and arguments they've seen tend to follow. As such, they can be useful to guide the proof search, if the relevant proof follows existing patterns.

Some would claim that guiding the search is understanding. I don't, in the same way I don't think a theorem prover checking a proof understands the proof. However just as theorem provers can be useful, so can this.

Essentially what the OpenAI people call "thinking" is not just considering one probable path through the proof but many... and then trying to figure out which one has the highest probability... like the good old tree search algorithms of old. Essentially the LLM serves as a better heuristic.


Anybody who "thinks" that "AI" understood anything is either deluding themselves, ignorant, or a huckster. It has been abundantly clear all along, as it has been in the actual research on LLMs, that they don't "understand" anything: they basically just use a really, really complex form of pattern recognition to 1) find correlations in datasets, and more recently 2) "generate" "new" things like text and images, but it's all based on learning patterns from a dataset and then outputting something from those patterns based on pre-provided rules. It's a bot.

As previous commenters have stated, this model of 1) finding correlations in a dataset and then 2) outputting "new" things based on that dataset is nowhere near logical deduction. Every single bot company is hoping that people (who buy their stuff) and software developers (who still will by and large be necessary to program computers) ignore this basic fact about bots. And the only way they propose to solve these problems is to add more data to their training dataset, or to add more rules to the output that prevents weird things from happening. There is no escape hatch from this escalation. Bot companies are going to be stuck in an endless cycle of adding new training data, weird and inaccurate things happening, adding rules to prevent the weird things, and then adding more training data to make it slightly more accurate, at the expense of eating up all the power and water on our planet.

Now, you could philosophically argue that humans, at their very core, are just a very, very, very, very, very, very advanced bot that does the same thing of pattern recognition and outputting things based on those patterns. That may very well be, and I'm probably not qualified to say whether that's true or not (is anybody?). But there's something about spontaneous creativity that current bots just can't emulate, and they won't be able to do that for a long, long time. Whether humanity accepts 29.365% accuracy as good enough, I guess we'll see.

Leave a Comment