LLM Benchmarking: Surprising Task Complexity Gains

The main purpose of many Language models (LLMS) provides a convincing text that is closest to that it can distinguish from human writing. Here is a major reason because it is very difficult to measure the relative performance of LLMS using traditional standards: the quality of writing is not necessarily related to the traditionally used standards to measure the processor’s performance, such as the rate of implementation of the instructions.
But researchers in Berkeley, California. Form evaluation and threat researchI reached a brilliant idea. First, specify a series of variable complexity tasks and record average time that a group of people takes to complete each task. Then you have many versions of LLMS that complement the same tasks, indicating cases where the important LLM version successfully complements with a reliable level, for example, 50 percent of the time. The resulting data plans confirm that over time, successive generations of LLM can complete longer, longer (more and more complex) tasks.
No surprise there. But the shock was that this improvement in LLMS is to complete the most difficult tasks. SiWith a double period of about seven months.
IEEE SICTRUM I reached Megan KennettOne of my authors Metr research paper Description of this work and its amazing effects.
LLM performance evaluation
Do you suspect that you will get these results?
Megan Kennett: I, at least personally, I didn’t expect us to have clearly clear as we did. The models are definitely improving quickly, though. So the rapid progress rate was not completely unexpected.
As it indicates in the paper, it is always dangerous to look at the future and induction. However, you suggest that there is a possibility to continue this, which means that by 2030 we will consider tasks a month in a more advanced possibility Language models.
Kinniment: Let’s take a look at that. In one month, we mean about 167 working hours, so the number [human] Working hours in one month. This is in the reliability of 50 percent. But it seems that the longest tasks require a higher reliability to be really useful. This is something that can make economic effects in the real world, not intense as expected.
There are a number of things that should continue until this prediction is achieved. The devices should continue to improve almost at a rate; The software should continue to improve. You must have sufficient training data and the availability of these training data to continue training in the back section that occurs in recent years.
Kinniment: The expectations and dates we found are just the extraction of the direction we see in our group of tasks. [The trends are] Failure to observe realistic factors or evaluation changes.
If a large language model can somehow achieve the ability to complete the tasks of the type for 167 hours with a 50 percent reliability, what are the types of things that it now places in the world of the ability to have a big language model?
Kinniment: Well, the biggest that we think is often accelerating the search for AI R&D research itself. To the extent that you can create models accelerating your company’s ability to create better models, you may end up in a position in which artificial intelligence capabilities develop very quickly.
What is the si growth in artificial intelligence means humanity
What you describe reminds us of an idea UniquenessWhere you have AIS creation AIS on its own, not helping by humans.
Kinniment: I think you can get a great acceleration and make things more difficult to control it without necessarily leading to this explosive growth. There are reasons to believe that you may have different bottlenecks that slow down in practice. Even if this was the case, this pace could have finished progressing on things like devices and Robots. But yes, Uniqueness Certainly, an idea is related to this entire sector is one of the things.
Things can go very quickly, but they are not like Uniqueness Or nothing. [AI-development rates] This was moderate compared to the individual could remain very intense for how the world needed to adapt.
I have made it clear in the paper that some large language models improve their ability to adapt and improve errors.
Kinniment: I think it was actually a relatively gradual thing since then ChatgptPerhaps before that. They are less likely to comment. It is a little better in changing strategies when things do not work, but this has been hit a little or missed. It is definitely much better to do things more than they were to use tools. But it seems that there are some basic aspects that have not changed much. One thing I would like to look at when I get a new model is, at every task, we give the model a number of SymbolsA number of words that can say. And if you can imagine give them more time or more and more symbols to do a mission, how does this affect the possibility of their success? Essentially, what we see is that it is a powerful plateau. There is a point in which you give them more symbols and not really help. For every new model, this plateau increases slightly.
Megan Kennett was in the team in a meter, which published the results of the LLM performance study.Megan Kennett
Humans, imagine, have decreased returns. But if you give a lot of time and a lot of time to do something, they are likely to do better, especially if you have many people. And I think I will be very admired by a big language model, even if its absolute degree is less, it seemed that it could continue to do things and improve. It can be a big deal.
I found that the models were worse in the tasks that had higher “chaos” degrees. Was there any indication that you got out of the data that this situation might change? In other words, these models may gain a greater ability to deal with tasks that have had higher chaos?
Kinniment: The chaos was a measure I made to try to get a somewhat quantitative scale for an unrealistic range of our tasks in the real world. Most of our duties are not that chaos. It is a scale of 16 points. The middle is about 3, the most chaos tasks are about 8 out of 16.
So what will be the task of 16 in terms of chaos?
Kinniment: Something like to spyWhere you have a lot of resource restrictions. It is very punished. You have agents who are improving against you actively. Easy to chaos. It is a novel.
Are you all planning to follow this study?
Kinniment:Openai Published O3And O3 was more able to look at the direction. So we do some follow -up in terms of measuring other models. We want to continue to focus on informing the world about developing artificial intelligence and catastrophic risks of artificial intelligence systems.
QACRAI risks from advanced artificial intelligence
What are the most likely catastrophic risks of artificial intelligence? I mean, those that come to my mind are tremendous dislocation at work if and when artificial intelligence becomes very capable.
Kinniment: When we talk about catastrophic risks, we are not only talking about the Mass Unemployment. We are talking about things like this more: If everyone becomes unemployed or you do not need human workers for the vast majority of things, you may not need human workers to keep your army, or fewer people. This may make someone easier for a coup, mainly. Or, if you have a huge amount of geniuses in the data center, this would make you a very strong person. If you use this to produce military devices, we are likely to have a focus on power, and you may not have a democratic state anymore.
All this will happen, clearly, without any form Consciousness. These will be machines that will get the ability to plan, plan and plan, but without the kind of consciousness that distinguishes human ability to do so. Awareness is not necessary for this.
Kinniment:Awareness is a difficult problem. I am not sure whether awareness is necessary for any specific behavior. It is a little higher than my salary. I also think it is not crazy to be conscious at this stage. They will be very smart.
Do you think it can be conscious at some point in the future?
Kinniment: I mean, if it is smart like you and me, it will not look completely crazy. It does not seem crazy to be, and it does not seem crazy to be.
From your site articles
Related articles about the web