A Test So Hard No AI System Can Pass It — Yet

News Fetcher2 weeks ago

0 9 5 minutes read

A Test So Hard No AI System Can Pass It — Yet

If you’re looking for a new reason to worry about AI, try this: Some of the world’s smartest humans are struggling to create tests that AI systems can’t pass.

For years, AI systems have been measured by giving new models a variety of standardized tests. Many of these tests consist of challenging SAT-caliber problems in areas such as mathematics, science, and logic. Comparing model results over time has been a rough measure of AI progress.

But AI systems eventually became very good at those tests, so new, more difficult tests were created, often with the types of questions graduate students might face on their exams.

These tests are not in good shape either. New models from companies like OpenAI, Google, and Anthropic have received high scores on many PhD-level challenges, limiting the usefulness of those tests and leading to a scary question: Are AI systems getting too smart for us to measure?

This week, researchers at the Center for AI Safety and the AI Benchmark launched a potential answer to that question: a new assessment called “The Last Test of Humanity,” which they claim is the hardest test ever conducted for AI systems.

The Last Test of Humanity is the brainchild of Dan Hendricks, a well-known AI safety researcher and director of the Center for AI Safety. (The original name of the test, “Humanity’s Last Stand”, was ignored because it was too dramatic.)

Mr. Hendricks worked with Scale AI, an artificial intelligence company where he works as a consultant, to put together the test, which consists of about 3,000 multiple-choice, short-answer questions designed to test the capabilities of artificial intelligence systems in areas ranging from analytical philosophy to rocket engineering. .

The questions were submitted by experts in these fields, including university professors and award-winning mathematicians, who were asked to ask very difficult questions to which they knew the answers.

Here, try your hand at a question about hummingbird anatomy from the quiz:

Hummingbirds within the Apodiformes are characterized by the presence of a diploid, oval bone, which is a sesamoid embedded in the caudolateral portion of the expanded cruciform aponeurosis resulting from the insertion of m. Depressor tail. How many paired tendons does this sesamoid bone support? Answer with a number.

Or, if physics is more your speed, try this:

A block is placed on a horizontal rail along which it can slide without friction. It was attached to one end of a massless solid rod of length R. A mass was attached to the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the mass. The block is given an infinitesimal push, parallel to the rail. Assume that the system is designed so that the rod can rotate a full 360 degrees without interruption. When the rod is horizontal, it holds tension T1. When the bar becomes vertical again, and the block is directly below the block, it holds tension T2. (Both quantities can be negative, indicating that the rod is in compression.) What is the value of (T1−T2)/W?

(I’ll print the answers here, but that would ruin the testing of any AIs trained for this column. I’m also too stupid to check the answers myself.)

The questions for the final Humanity test went through a two-step filtering process. First, the submitted questions are submitted to leading AI models to solve.

If the models could not answer them (or, in the case of multiple choice questions, the models performed worse than random guessing), the questions were submitted to a group of human reviewers, who refined them and checked for correct answers. The experts who wrote the top-ranking questions were paid between $500 and $5,000 for each question, in addition to receiving credit for contributing to the test.

Kevin Chu, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, provided a number of questions for the test. Three of his questions were selected, and he told me that all of them were “in line with the upper range of what one would see on a graduate exam.”

Mr. Hendricks, who helped create a widely used AI test known as massive multi-tasking language understanding, or MMLU, said he was inspired to create harder AI tests by a conversation with Elon Musk. (Mr. Hendricks is also a safety consultant for Mr. Musk’s artificial intelligence company, xAI.) He said Mr. Musk raised concerns about current tests provided for AI models, which he thought were too easy.

“Elon looked at the MMLU questions and said, ‘These are university level.’” “I want things that a world-class expert can do,” Mr. Hendricks said.

There are other tests that attempt to measure advanced AI capabilities in specific fields, such as FrontierMath, a test developed by Epoch AI, and ARC-AGIa test It was developed by artificial intelligence researcher François Cholet.

But the latest test of humanity aims to determine how well AI systems can answer complex questions across a wide range of academic subjects, giving us what might be considered a general intelligence score.

“We’re trying to estimate the extent to which AI can automate a lot of really difficult intellectual work,” Hendricks said.

Once the list of questions was compiled, the researchers gave the final Humanity test to six leading AI models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. They all failed miserably. OpenAI’s o1 system scored the highest of the group, at 8.3 percent.

(The New York Times has File a lawsuit against OpenAI and its partner Microsoft are accusing them of copyright infringement on news content related to artificial intelligence systems. OpenAI and Microsoft have denied these claims.)

Hendricks said he expects those scores to rise quickly, potentially exceeding 50 percent by the end of the year. At that point, he said, AI systems could be considered “world-class soothsayers,” able to answer questions on any topic more accurately than human experts. We may have to look to other ways to measure the effects of AI, such as looking at economic data or judging whether it can make new discoveries in areas such as mathematics and science.

“You can imagine a better version of this where we can ask questions that we don’t know the answers to yet, and be able to check if the model can help solve them for us,” said Summer Yu, of Scale. Amnesty International’s research director and exam organizer.

Part of what’s so confusing about the advancement of AI these days is how rough it is. We have artificial intelligence models that are capable of this Diagnose diseases more effectively than human doctors, Winning silver medals in the International Mathematical Olympiad and Beating top human programmers About the challenges of competitive programming.

But these same models sometimes have difficulty performing basic tasks, such as arithmetic or writing metrical poetry. This has given them a reputation for being amazingly good at some things and completely useless at others, and has created very different impressions of how quickly AI improves, depending on whether you look at the best or worst outcomes.

This roughness also made measuring these models difficult. I wrote that last year We need better evaluations of AI systems. I still think so. But I also think we need more creative ways to track AI progress, that don’t rely on standardized tests, because most of what humans do — and what we fear AIs will do better than us — can’t be captured on a written test. .

Mr. Zhou, a researcher in theoretical particle physics who submitted questions to The Last Test of Humanity, told me that although AI models were often impressive at answering complex questions, he did not consider them a threat to himself and his colleagues, because their jobs involved A lot more than spitting out the correct answers.

“There is a huge gap between what it means to do testing and what it means to be a practicing physicist and researcher,” he said. “Even AI that can answer these questions may not be ready to assist in research, which is inherently less structured.”

News Fetcher2 weeks ago

0 9 5 minutes read