A classic brain test exposed AI's biggest weakness
Researchers gave top AI models a classic attention test used in psychology and found a major flaw. While the models could correctly name colors in short lists, their performance deteriorated sharply as the task became longer and more complex. Some leading systems fell from over 90% accuracy to nearly complete failure.
Artificial intelligence systems can write essays, answer questions, and solve complex problems. But new research suggests they may struggle with something humans do every day: staying focused on the task at hand when distractions get in the way.
Researchers led by Suketu Patel put several leading AI models through a well-known psychology experiment called the Stroop task. The results revealed a significant difference between how AI systems process information and how the human brain manages attention.
The Stroop task is a classic psychological test that has been used for decades to study attention, concentration, and self-control.
In the test, color words such as "red," "blue," or "green" are displayed in colored ink. Sometimes the word and the ink color match. For example, the word "red" might appear in red ink. Other times they conflict, such as the word "red" printed in blue ink.
Participants are asked to name the color of the ink rather than read the word itself.
That sounds simple, but it creates a challenge because reading words is an automatic habit for most people. The brain must suppress the urge to read the word and instead focus on identifying the ink color.
Psychologists often use the task to measure what is known as executive control, a set of mental processes that helps people regulate attention, resist distractions, and stay focused on goals.
The researchers wanted to see whether modern large language models (LLMs) handle this challenge in the same way humans do.
LLMs are the AI systems behind tools such as ChatGPT, Claude, and Gemini. They are trained on enormous amounts of text and learn patterns in language, allowing them to generate responses that often appear remarkably human.
When given short lists containing five color words, the AI systems generally performed well, even when the words and colors did not match.
However, the picture changed dramatically as the lists became longer.
GPT-4o achieved 91% accuracy when working with five words. At ten words, its accuracy fell to 57%. When the list expanded to forty words, accuracy dropped to just 15%.
Claude 3.5 Sonnet maintained stable performance through lists of twenty words but then experienced a sharp decline, falling to 24% accuracy with forty-word lists.
The researchers observed similar patterns in GPT-5, Claude Opus 4.1, and Gemini 2.5.
The challenge became even more difficult when matching and mismatched color words appeared together in the same list.
Under those conditions, performance deteriorated further. Accuracy for the mismatched items dropped to nearly zero i
📌 Kaynak
Bu özet ScienceDaily Tech kaynağından otomatik derlenmiştir. Tamamı için orijinal habere gidin.
Orijinal haberi oku →