What did Apple see? Nothing good.
A group of Apple researchers has published a paper claiming that large language models (LLMs), the backbone of some of AI’s most popular products today, like ChatGPT or Llama, can’t genuinely reason, meaning their intelligence claims are highly overstated (or from a cynical perspective, that we are being lied to).
Through a series of tests, they prove that their capacity to reason is most often — or totally — a factor of memorization and not real intelligence.
This adds to the growing trend of disillusionment around LLMs, which could cause a massive shift in investment and directly impact the future of many multi-billion-dollar start-ups. Naturally, it also seriously questions Big Tech’s billion-dollar AI expenditures and frontier AI labs’ future, who are dependent on this precise vision being true.
So, what is the base of these strong adversarial claims against LLMs?
A Growing Disenchantment
If you lift your head over the media funnel of AI outlets and influencers that simply echo Sam Altman’s thoughts every time he speaks, you will realize that, despite the recent emergence of , the sentiment against Large Language Models (LLMs) is at all-time highs.
The reason?
Despite the alleged increase in ‘intelligence’ that o1 models represent, they still suffer from the same issues previous generations had. In crucial aspects, we have made no progress in the last six years, despite all the hype.
A Mountain of Evidence
Over the last few weeks, especially since the release of o1 models, which are considered a new type of frontier AI models known as Large Reasoner Models (LRMs), an overwhelming amount of evidence has appeared to suggest that this new paradigm, while an improvement in some aspects, still retains much of the issues the very first Transformer presented back in 2017.
- As proven by Valmeekam et al, they are still terrible planners (the ability to break complex tasks into a plan with simpler steps), underperforming brute-force search AI algorithms like Fast Downward released more than ten years ago,
- As proven by MIT researchers, they underperform ARIMA, a ’70s statistical method, in time-series anomaly detection,
- Another group of researchers has also proven that, in the absence of experience or knowledge in a subject, LLM performance considerably degrades, even for o1 models, even when having all available data as part of the prompt. Long story short, LLMs/LRMs can’t seem to follow basic instructions, especially when instruction length increases.
- As evidenced by University of Pennsylvania researchers, they are extremely susceptible to seemingly irrelevant token variations in the sequence. For instance, the example below shows how a simple switch between ‘Linda’ and ‘Bob,’ utterly irrelevant to the reasoning process required to solve the problem, confuses the LLM and leads to failure.
However, none have been harsher on LLMs than Apple, coming out of the gate swinging with a lapidary statement: “LLMs do not perform genuine reasoning.”
And how are they exposing this alleged farce?
Token Biased & Easily Fooled
GSM8k is a very popular math-focused benchmark that tests LLMs’ capacity to solve grade school problems. Today, this benchmark is considered solved because most frontier LLMs saturate their scores.
But Apple researchers posited: How much of this performance is due to memorization and superficial pattern matching and not actual reasoning?
And the results are concerning to say the least.
Heavily token biased
For starters, it’s becoming clear that these models’ ‘reasoned’ outputs are more based on sequence familiarity than real reasoning.
As we saw in the previous image of the ‘Linda’ and ‘Bob’ switch, a simple name change is enough to make the model fail. The reason for this is that the model, far from having internalized the reasoning process, has simply memorized the training sequence.
But why does Linda work but Bob doesn’t?
The example above is the famous conjunction fallacy, when people think that a specific set of conditions is more likely than a single general one, even though that’s not logically true.
The LLM gets it right when the name used is ‘Linda’ because that’s the name that Kahneman and Tversky used in their work to illustrate this fallacy, which means LLMs have seen this problem several times during training where the name used was Linda. Thus, failure to adapt to new names suggests that LLMs simply memorize the entire sequence instead of fully internalizing the fallacy.
In other words, the model has literally memorized the sequence “Linda is 30…” continuing with “This question is a classic example of the conjunction fallacy…”, a sequence most definitely seen during training.
As it’s pure memorization, a simple change to Bob breaks the superficial pattern, showcasing how LLMs are mostly vacant of higher-level abstractions resembling deep human reasoning (that minor change wouldn’t fool us).
But Apple wanted to test this further. Thus, they created an alternative dataset, GSM-Symbolic, which used templates from the original questions that allowed them to modify specific tokens in the sequence, generating identical problems reasoning-wise with small variations:
In doing so, the results across all evaluated LLMs show varying decreases in performance, even for frontier AI models, although scale seems to be a factor in favor; the larger the model, the less prone to such issues.
But Apple didn’t stop here.
Difficulty degrades performance
Next, they wanted to test the model’s capabilities in harder questions, building on the original dataset but adding progressively harder parts to the question:
In doing so, as expected, performance degrades consistently across all models, including o1-mini, and also increases variance. In other words, their supposed intelligence is not only exaggerated, but robustness decreases the higher the complexity (although this is expected).
But the most interesting results came with the next testing round.
Easily Fooled
They decided to test the model’s capability to acknowledge inconsequential clauses that Apple describes as not having “operational significance,” aka utterly irrelevant to solving the problem, thereby creating GSM-NoOp.
In layman’s terms, these are clauses added to the problem statement that appear to be relevant but aren’t, in an effort to show how superficial these models’ pattern-matching capabilities are.
As you can see in the example below, they add a statement that appears relevant (it still refers to the kiwis) but is absolutely irrelevant to the problem (we are counting kiwis; size does not matter in this case).
Still, o1-mini and Llama3-8B subtract those smaller kiwis for no particular reason.
To explore further, they also created two additional variations:
- NoOp-Symb: In this case, in addition to the target modified question, they also provide the model with up to eight instances of that same question from the original dataset with no modifications and with the reasoning process included. In other words, while the target question is modified with an irrelevant clause, the model receives eight instances of the same question with the correct reasoning steps.
- NoOp-NoOp: In this other case, using the GSM-NoOp dataset of variations of the original questions that include an inconsequential clause, along with the required reasoning process, they also provide eight other instances on how to solve them. Although they are all variations of different questions, they all have in common the fact that the correct reasoning process should ignore the inconsequential clause.
Long story short, we are testing the LLM’s capacity to acknowledge and ignore noise. And the results are quite dramatic.
Across all model sizes, performance degrades ostensibly, with some models reaching almost 70% performance drop. We also observe ‘weird’ behaviors, such as Gemma2B’s incredibly good performance in NoOp-Symb compared to its other results, meaning that its in-context learning capabilities, its capability to use information in the prompt as context to improve its prediction, might be superior than expected.
Again, o1-preview showcases the least performance degradation, which begs the question: While results are terrible and show clear evidence that LLM/LRMs do not reason, is scale (increasing the model’s training runs, usually making them larger) the solution, as AI frontier labs like OpenAI or Anthropic expect?
Scale Is All You Need, Or A Farce?
How you digest this research inevitably depends on your biases, i.e., whether you have internalized LLM/LRMs as the solution to AGI or whether you think they aren’t.
- If you’re Gary Marcus, a famous LLM skeptic, you take this research by Apple as unequivocal truth that LLMs do not reason.
- If you’re Andy Clark, Anthropic’s cofounder, you see the glass half full and center your entire argument around the idea that smaller models are overfitted to benchmarks and, thus, more prone to issues like token bias, and point to o1’s less-degraded results to suggest that LLM scaling will do the trick.
While there’s no definite answer to which side is correct, here are the reasons why I align more with Apple and Gary Marcus than with Andy Clark.
Why Shouldn’t I?
Andy mentions that smaller sizes overfit more to benchmarks, meaning that the model memorizes the solutions to the problems, suggesting that if we increase model size, it will be harder to overfit the dataset.
On the flip side, larger models, while indubitably less overfit to datasets, have seen much larger amounts of data and quite possibly even variations of benchmarks sent on purpose by OpenAI/Anthropic researchers to make their models more robust to these minor changes.
In other words, the question here is whether, with scale, the LLM’s process to solve problems increases in intelligence, or whether it all boils down to memorization and we are simply feeding it more data and, thus, making it memorize larger sets of information, thereby appearing smarter than they really are.
From what I’ve seen, the latter is much more probable. Despite the growing complexity of problems AI can solve, they are still very easily fooled in unfamiliar settings, even when the task is much simpler than other problems LLMs can solve.
But why is this?
The Task Familiarity/Complexity Conundrum
The secret might be in what François Chollet describes as the task complexity/familiarity conundrum; we shouldn’t evaluate model intelligence based on task complexity because researchers can always feed the reasoning process as part of the training data. That way, the model already knows how to solve the problem; it just needs to retrieve it from its core knowledge.
Instead, we should evaluate models based on task familiarity, or whether models can reason and solve problems they have not seen before.
This is definitely not the process we follow today, where we simply look at the model’s output and, based on the complexity of the task being solved, ascertain the model’s intelligence.
But here’s precisely where the issue lies; it’s almost impossible to see whether a model is intelligent based on its outputs because they could be memorized. Instead, we should look at the model’s process to see if real reasoning is taking place.
While this is not a simple thing to do, one thing we can do is develop benchmarks explicitly conceived to prevent memorization, i.e. test models in situations we know for a fact they haven’t seen earlier. In those scenarios, well-reasoned outputs could be seen as a guarantee of System 2 reasoning (the one humans use when consciously working on a problem) taking place inside the AI model.
And when we apply this seemingly harmless framing modification, LLM/LRM performance collapses, as proven by the ARC-AGI benchmark (which tests precisely this, how well models can acquire new skills efficiently and solve problems they have not seen before). Concerningly, as you can see below, despite o1-preview having search capabilities, it reaches the same accuracy as a standard LLM.
In other words, adding a Reinforcement Learning-based search on top of an LLM, a succinct representation of what o1 models are, leads to absolute no gains in memorization-resistant benchmarks.
Or, to speak plainly, o1 models aren’t better System 2 reasoners than standard LLMs in scenarios where they can’t retrieve the original solution from their past experiences (training).
The Great Humbling
I take issue with the fact that instead of LLM proponents claiming to be building AGI and having to prove that is the case, it seems like AI academia is being forced to prove otherwise.
Shouldn’t it be the wrong way around?
Shouldn’t we be of the opinion that, unless proven otherwise, LLMs are not the solution?
Clearly, the reason for this is none other than the abhorrent amounts of capital being invested into this vision being true. However, as we’ve seen today, there are many more reasons to be skeptical about LLM’s reasoning capabilities than simply believing that scale will do the trick.
Thus, to me, as long as LLMs do not prove some capability to deal with unfamiliar tasks, they should not be treated as intelligent machines but, rather, as amplifiers of human intelligence that simulate it rather than embody it.
And you, what do you think?
Do you believe that the capabilities of frontier AI models are vastly overstated, be it because a lot of money is on the line or Silicon Valley group fever, or is that scale really is the obvious straight path to human-level intelligence?
Written By: Ignacio de Gregorio.
This article was originally published on Medium.