Nvidia Blackwell Leads AI Inference, AMD Challenges

News FetcherApril 2, 2025

0 30 2 minutes read

Nvidia Blackwell Leads AI Inference, AMD Challenges

In the last round of Automated learning Standard results from mlcommons, computers that were built around the new GPU structure in NVIDIA over all others. But the latest AMD circulation on its instinct Graphics processing unitsMI325 has proven games for Nafidia H200, the product was intended to confront. The similar results were mostly on the one -scale test Language models Llama2 70B (compared to 70 billion teachers). However, in an attempt to keep pace with the scene of rapidly changing artificial intelligence, Mlperf He added three new criteria to better reflect where to learn.

Mlperf runs the measurement of automated learning systems in an attempt to provide a comparison of computer systems. Applicants use their programs and devices, but the basic Nervous networks It should be the same. There are a total of 11 criteria for Servers Now, with three addition this year.

Miro Hodak, co -chair of MLPERF inference. Chatgpt He only appeared in late 2022, Openai The first large linguistic model (LLM) revealed that it could think of tasks last September, and LLMS has grown dramatically – GPT3 had 175 billion teachers, while GPT4 is believed to have nearly 2 trillion. As a result of innovation Breakneck, “W“The pace of obtaining new criteria has increased in this field,” says Hodak.

The new standards include two llms. The famous Llama2-70B is the relatively integrated and integrated MLPERF standard, but the union wanted something that mimics the response to people Chatbots today. So the new standard “Llama2-70B Inactive” tightens the requirements. Computers should produce at least 25 icons per second under any circumstances and cannot take more than 450 milliseconds to start an answer.

Seeing a rise.Artificial intelligence agent– The networks that can cause through complex tasks – sought a LLM test that will have some of the necessary features for that. They chose Lama3.1 405b for the job. This LLM has a so -called wide context window. Like Llama 70B.

The new final standard, called RGAT, is the so -called graphic attention network. It works to classify information in the network. For example, the data group used to test RGAT consists of scientific papers, which all have relationships between authors, institutions and fields of studies, which make up 2 TB of data. RGAT should be classified for less than 3000 topics.

Blackwell, instinct results

Nafidia She continued to dominate MLPERF standards through her application operations and those that include about 15 partners such as Deland GoogleSuperMicro. Both the first and second generation Lying down Architectural graphics processing units-H100 and H200 Memory-made of strong offers. “We managed to obtain a 60 percent performance during the past year” from Huber, who entered production in 2022, he says Dave SalvatorThe director of accelerating computing products at NVIDIA. “He still has some main space in terms of performance.”

But it was Nafidia Blackwell GPU Architecture, B200, which really dominated. “The only thing the fastest thing from Hopper is Blackwell,” said Salvator. B200 packages in the 36 percent high -frequency domain memory, but more importantly, it can lead the educated mathematics to the machine using numbers with accuracy of 4 bits instead of 8 bits. Low accuracy calculation units are smaller, so more convenient on the graphics processing unit, which leads to a faster AI computing.

In standard llama3.1 405b, the SUPERMICRO eight billion system connected nearly four times the distinctive symbols per second of the eight H-200 system by Cisco. The same SuperMicro system was three times as soon as the H200 computer in the interactive version of Llama2-70B.

NVIDIA used its Blackwele graphics processing units and The grace of the CPUGB200 is called to show the quality of its NVL72 data links by combining multiple servers into a shelf, so they lead as if the giant graphics processing unit was. In an unintended result, the company shares with correspondents, provides a full shelf of GB200 869200 icons/s icon/s on Llama2 70B. The fastest reported system in this round of Mlperf was the NVIDIA B200 server that delivered 98,443 symbols/s.

AMDYou put the latest GPU instinct, Mi325X, as a competitive performance for NVIDIA H200. The MI325X has the same as its previous structure MI300, but it adds more high frequency domain memory and the frequency domain display of-256 GB and 6 TB (33 percent and 13 percent, respectively).

Add more memory is a play to deal with LLMS larger and larger. “The larger models of taking advantage of these graphics processing units say because the model can suit a single graphic processing unit or one server. What is a operameanDirector of the GPU Center Data Center in AMD. “So you should not get this communication from the transition from the graphics processing unit to the graphics processing unit or another server to another server.Take out those connections that improve your cumin slightly. “AMD was able to benefit from the additional memory by improving programs to enhance the speed of inferences for Deepseek-R1 8-Pold.

In the Llama2-70B test, the MI325X computers came in 3 to 7 percent of the similarly deceived H200 system. On the generation of photos, the MI325X system was in the range of 10 percent of the NVIDIA H200 computer.

The other AMD brand was noticed by this round of its partner, Mangoboost, which showed a performance nearly four times in the Llama2-70B test by conducting the account via four computers.

Intel The CPU systems have historically developed only in the inference competition to show that for some work burdens, you do not really need the graphics processing unit. This time I saw the first data of Intel’s Xeon 6 chips, which were previously known as Granite Rapids and used to use Intel 3 nm operation. In 40,285 samples per second, the best Learn about pictures The results of a dual Xeon 6 computer were about a third of the CISCO computer performance with two NVIDIA H100s.

Compared to the results of XEON 5 of October 2024, the new CPU provides about 80 percent for this standard and a greater increase to detect objects and Medical photography. Since it started to present the results of Xeon in 2021 (Xeon 3), the company has achieved a 11 -time performance in Resnet.

Currently, it appears that Intel has left the field in the Battle of Chip Chip Acerator AI. Her alternative to Nvidia H100, Gaudi 3The results of the new Mlperf did not appear or in version 4.1, which was released last October. Gaudi 3 got a late version of the plan because it is The program was not ready. In the opening notes in Intel Vision 2025The Customer Conference only invited to the company, it seems that the CEO of Bu Tan apologizes for the Intel efforts of artificial intelligence. “I am not happy with our current position.” The attendees said. “You are not happy too. I hear you loudly and clearly. We are working on a competitive system. This will not happen overnight, but we will get there for you.”

GoogleTpu V6E Chip also made an offer, although the results were only bound by the task of generating images. At 5.48 Information per second, 4 TPU system seen a 2.5X batch on a similar computer using its TPU V5E predecessor in October 2024 results. However, 5.48 queries per second were almost in line with the similar size Lenovo Computer using Nvidia H100s.

This post was corrected on April 2, 2025 to give the correct value to the high -frequency memory in the Mi325X.

From your site articles