Understanding the Limitations of Mathematical Reasoning in Large Language Models
Hartley Charlton (Hacker News):
The study, published on arXiv, outlines Apple’s evaluation of a range of leading language models, including those from OpenAI, Meta, and other prominent developers, to determine how well these models could handle mathematical reasoning tasks. The findings reveal that even slight changes in the phrasing of questions can cause major discrepancies in model performance that can undermine their reliability in scenarios requiring logical consistency.
Apple draws attention to a persistent problem in language models: their reliance on pattern matching rather than genuine logical reasoning. In several tests, the researchers demonstrated that adding irrelevant information to a question—details that should not affect the mathematical outcome—can lead to vastly different answers from the models.
Everyone actively working with AI should read it, or at least this terrific X thread by senior author, Mehrdad Farajtabar, that summarizes what they observed.
[…]
Another manifestation of the lack of sufficiently abstract, formal reasoning in LLMs is the way in which performance often fall apart as problems are made bigger.
[…]
What I argued in 2001, in The Algebraic Mind, still holds: symbol manipulation, in which some knowledge is represented truly abstractly in terms of variables and operations over those variables, much as we see in algebra and traditional computer programming, must be part of the mix. Neurosymbolic AI — combining such machinery with neural networks – is likely a necessary condition for going forward.
This is a problem for anyone who belueves they can build autonomous AI agents on this foundation since it means anytime the “agent” sees a pattern it doesn’t recognize, it will fail hilariously or even catastrophically.
The most surprising part of the news that Apple researchers have discovered that LLMs can’t reason is that anybody who had even a layman’s understanding of LLMs thought they could in the first place.
I think that what [LLMs] do is similar to our human so called “intuition”: they recognize “patterns they’ve seen before and intuitively go to the answer that worked then.”
This is an important aspect of how Inthink and a lot of the creative process I have at work is a back and forth between “intuition” and verifying that it sustains a more rigorous model.
[…]
LLMs have a role into an actual form of AI. It just can’t be on its own.
LLMs can’t do math because they don’t actually understand concepts; they are just really fancy autocomplete engines.
We knew that already, but this paper quantifies it. The math performance is really pretty dismal even with training that tries to optimize for math. The best performance was by OpenAI’s GPT-4o, which scored around 95% for the most basic of grade-school word problems, which means it got 1 in 20 questions wrong, which means it’s not usable for anything in production.
[…]
IMO the biggest problem with LLMs is not that performance is poor, but that there is no way to tell when they get it wrong. The models may make one mistake in a million, but *which output is the wrong one*?
In December, NARA plans to launch a public-facing AI-powered chatbot called “Archie AI,” 404 Media has learned. “The National Archives has big plans for AI,” a NARA spokesperson told 404 Media. It’s going to be essential to how we conduct our work, how we scale our services for Americans who want to be able to access our records from anywhere, anytime, and how we ensure that we are ready to care for the records being created today and in the future.”
Employee chat logs given during the presentation show that National Archives employees are concerned about the idea that AI tools will be used in archiving, a practice that is inherently concerned with accurately recording history.
Previously:
3 Comments RSS · Twitter · Mastodon
Anyone who is surprised by these findings is not a person I would ever take seriously. Companies have to know this is a billion-dollar parlor trick because it's been so obvious for so long from so far away.
I have a hard time really getting it into people's heads that "AIs" can't actually reason and they're not the Enterprise computer, no matter how personable or confident they may be or how much their simulated voice may sound like Majel Barrett.
@Bri: Me too, even to technical people, even explaining how they work. The need to anthropomorphize seems extraordinarily high.
To give them their due, they are useful as a sort of thesaurus though, when you're looking for technical terms in a field that you then want to look up. It's just important to make sure you don't believe them, however reasonable they seem, and check on everything they say. The "All Cretans lie" paradox in real life we all must resolve!
If we continue trusting them though, we'll end up in a world where robots use them, and mistakenly put the baby into the oven and wash the beef for tonight's dinner in the bath tub. Stanislaw Lem's surreal worlds will come to life!