Meta gets caught gaming AI benchmarks with Llama 4

News FetcherApril 8, 2025

0 89 3 minutes read

Meta gets caught gaming AI benchmarks with Llama 4

During the weekend, two new Meta decreased Lama 4 modelsA smaller model called Scout and MAVERICK, a medium-sized model that the company claims to overcome GPT-4O and Gemini 2.0 Flash “through a wide range of widespread reported criteria.”

Moverry has quickly secured the second number on LMARNA, the location of the standard of artificial intelligence as humans compare the outputs of different systems and vote on the best. In Meta press releaseThe company highlighted the ELO degree in 1417, which it placed over Openai’s 4O and below the Gemini 2.5 Pro. (The higher ELO degree means that the model often wins the square when going face to face with competitors.)

The achievement seemed to put Llama 4 in Meta 4 serious competitors for modern, closed models of Openai, Anthropic, and Google. Next, artificial intelligence researchers discovered the Meta document something unusual.

In fine printing, Meta admits that the MAVERICK version that was tested on LMARNA is not the same as what is available to the audience. According to the special Maeta materials, it was published “Experimental chat version” From MAVERICK to LMARNA, which was specifically “improved to conversation”, ” Techcrunch Firstly I mentioned.

“A dead interpretation of our policy did not coincide with what we expect from the service providers.” to publish On X two days after the launch of the form. “Meta should have made it clear that” Llama-4-MAVERICK-03-26-EXPERIMENTAL “was a dedicated model for improving human preferences. As a result, we update our leaders policies to enhance our commitment to fairly repetitive evaluations so that this confusion does not occur in the future.”

A Meta spokesman did not respond to a time LMARNA statement in time to publish.

While what Meta did with MAVERICK is not explicitly against the LMARNA bases, the site has shared concerns About system games She took steps “to prevent excess and standard leakage.” When companies can provide versions specifically seized from their test models with the launch of various versions of the public, standard classifications such as LMARNA become less important as indicators of real performance.

“It is the most respectable general standard because everything absorbed by all other criteria,” tells the artificial intelligence researcher freedom. “When I came out Lama 4, the fact that she came second in the square, after Gemini 2.5 Pro – which really caused my admiration, and I kicked myself for not reading the small publication.”

Shortly after the launch of Meta Marterick and Scout, the artificial intelligence community began Talking about rumors Meta also trained Llama 4 models to perform better on the standards while hiding its true restrictions. The Vice President of the Truc Mettha in Meta, Ahmed Al -Dahli, took charges In a post on x: “We have also heard allegations that we trained in test sets – this is simply incorrect and we will never do so. We prefer our understanding that the changing quality that people see is due to the need to stabilize applications.”

“It is a very confusing version in general.”

some I also noticed Llama 4 was released in a strange time. Saturday does not tend to be when the news of the large Amnesty International decreases. After someone was asked about the topics about the reason for the launch of Llama 4 during the weekend, Mark Zuckerberg, CEO of Meta Company to reply: “This is when he was ready.”

Willeson says, who Follows closely and documents artificial intelligence models. “The degree of the model we got there is completely value. I cannot even use the model they got a high degree.”

The Meta Road to launch Llama 4 was not completely smooth. According to For a recent report from InformationThe company has repeatedly re -launch due to the model’s failure to meet internal expectations. These expectations are especially high after Deepseek, a start -up company from an open source from China, has released an open weight model that generates a ton of the duct.

In the end, the use of an improved model in LMARNA puts developers in a difficult position. When choosing models like Llama 4 for their applications, they naturally look forward to steering standards. But as for MAVERICK, these criteria can reflect the abilities that are not already available in the models that the audience can reach.

With the acceleration of the development of artificial intelligence, this episode shows how the standards have become a battle jacket. It also shows how Meta is keen to look at it as the leader of Amnesty International, even if that means playing the system.

News FetcherApril 8, 2025

0 89 3 minutes read