When billion-dollar AIs break down over puzzles a child can do, it’s time to rethink the hype | Gary Marcus

A I took the search paper from Apple The world of technology from the stormAll except to spoil the popular idea that the great language models (LLMS, the latest alternative, LRMS, big thinking models) We are able to think reliably. Some shock from that, while others are not. The famous capitalist Josh Wolf project went on like this As it is published On x that “Apple [had] Only garymarcus’d llm ” – formulating a new verb (and courtesy of me), Referring to “An act of exposing or exposing excessive capabilities of artificial intelligence … by highlighting their restrictions on thinking, understanding, or general intelligence.”
Apple did this By showing this “Leading models such as ChatGPT, Claude and Deepseek” may seem smart – but when the complexity rises, they collapse. “In short, these models are very good in a kind of identification of patterns, but they often fail when they face modernity that force them beyond the limits of their training, although they are, despite being, The paper also notes“Frankly designer of thinking tasks.”
As we were discussed later, there is a loose end that does not link the paper, but in general, its strength cannot be denied. To the extent that the defenders of LLM Partially waived A blow during the hint to or at least in hope, a happier future.
In many ways, the paper hesitates and an argument is inflated I made Since 1998: Neurological networks can be generalized of different types Within the distribution of the data they are exposed to, but its generalizations tend to collapse beyond this distribution. A simple example of this is that I have previously trained a model that has a very basic mathematical equation using only marital numbers training data. The model was able to generalize a little: a solution to the marital numbers that he had not seen before, but he was unable to do this to the problems in which the answer was a strange number.
After more than a quarter of a century, when the task is close to training data, these systems work well. But while moving away from these data, they often collapse, as they did in the most striking Apple paper tests. Such limits are still one The most important serious weakness In llms.
Hope, as always, was that “scaling” models by making them larger, would solve these problems. The new Apple paper rejects these hopes. They have been stabbed in some of the latest, largest and most expensive models with classic puzzles, such as Hanoi Tower – I found that deep problems remain. Besides many expensive failures in the efforts to build GPT-5 level systems, this is very bad news.
Hanoi Tower is a classic game with three pegs and multiple tablets, in which you need to move all tablets on the left wedge to the right wedge, do not accumulate a larger disk on one smaller head. Through practice, though, the seven bright (and the patient) can do so.
What apple It is found that barely pioneering obstetric models can do seven tablets, get less than 80 %, and you cannot get scenarios with eight correct tablets at all. It is really embarrassing that LLMS cannot reliably solve Hanoi.
As Iman Mirzada, the co -author of the paper via DM, told me, “It is not only a matter of solving the puzzle. We have an experience where we give the algorithm of the solution to the model, and [the model still failed] … based on what we notice from their ideas, their process is not logical and smart. “
The new paper is also frequent and inflated many of the arguments by the computer world at the University of Arizona University of the newly famous LRMS. It has noticed that people tend to the model of these systems, Assuming that they are using Something similar to “steps that a person may take when solving a difficult problem.” It has already been shown that they actually have the same kind of Apple documentation.
If you cannot use billions of dollars in the artificial intelligence system to solve a problem stating that Herb Simon (one of the actual AI men) Solution with classic (But out of fashion) Artificial Intelligence Technologies in 1957The chances of models such as Claude or O3 will reach the AGI are really far away.
So what is the loose thread that I warn you about? Well, humans are not perfect either. On a mystery like hanoi, ordinary humans have a set of (well -known) borders that he discovered to some extent what Apple discovered. Many (not all) humans are delightful on Hanoi Tower versions with eight tablets.
But see, for this reason we invented computers, and for this calculator: to reliably calculate solutions to major and boring problems. AGI should not be about a person’s repetition completely, but rather it should be about combining the best in the two worlds; Human adaptation to brute and reliable force. We do not want AGI to fail to “one pregnancy” in the basic calculation just because humans do sometimes.
Whenever people ask me why in reality Love Amnesty International (unlike the broad myth that I am against), and I believe that future forms of artificial intelligence (although artificial intelligence systems such as LLMS) are not necessarily of great benefit to humanity, and I refer to progress in science and technology that we may make if we can combine the causal capabilities of the best scholars with the ability of the Compute COMPUTE.
What Apple paper offers mainly, regardless of how you define AGI, is that these LLMS, which has given up a lot of noise is not a substitute for good, well -defined traditional algorithms. (They can also not play chess in addition to traditional algorithms, they cannot fold proteins such as special purposes nervous hybrids, databases as well as traditional databases, etc.)
What this means for business is that you can not simply drop O3 or Claude in a complex problem and expect them to work reliably. What it means to society is that we can never trust the regular organization; Its outputs are just hitting or missing.
One of the most surprising results in the new paper was that LLM might work well in an easy test set (such as Hanoi with four tablets) and your temptation to think that it built a suitable and generalized solution when it’s not.
Certainly, LLMS will continue to use it, especially for coding, mental mushrooms and writing, with humans in the episode.
But anyone believes that LLMS is a direct path to the AGI type that can mainly transform society for good.
-
Gary Marcus is a professor at New York University, the founder of two companies of artificial intelligence, and the author of the book of six books, including the taming of the Silicon Valley