Elon Musk Stated Grok 4 Was the “Smartest AI within the World,” However Its Leaderboard Scores Simply Got here Out and They Inform a Completely different Story

Elon Musk has been boasting about what he says are the unbelievable capabilities of xAI’s new Grok 4 AI chatbot.

“Grok 4 is smarter than virtually all graduate college students in all disciplines, concurrently,” Musk bragged, including that Grok 4 was “the neatest AI on the planet.”

Is it actually? Intelligence was a tough factor to measure even earlier than again earlier than AI hit the scene, however sure exams can present one thing of a clue.

One distinguished platform for doing so is the UC Berkeley-developed LMArena leaderboard, which crowdsources rankings on AI fashions by having customers rating their responses in classes starting from inventive writing and coding to math and imaginative and prescient.

In its newest scores, Grok 4 ranked third place total and on textual content era. Make no mistake, that is spectacular — but it surely’s nonetheless trailing behind superior fashions from Google and OpenAI. (Particularly, Google’s Gemini 2.5 positioned first and OpenAI’s o3 and 4o reasoning fashions tied for second, with GPT-4.5 tied with Grok 4 for third.)

Whereas Grok is clearly a fearsome competitor within the arenas of racism and antisemitism, in different phrases, even its newest launch clearly falls in need of being the “smartest AI on the planet.” (This is not totally shocking; Musk has a protracted historical past of fibbing in his skilled life, political actions, and even his hobbies.)

Maybe the one saving grace for Grok is the suggestion, per skilled criticism, that Berkeley’s chatbot area could also be extra vibes-based than strictly scientific.

In line with a latest examine, performed by a consortium of AI researchers and led by the machine studying agency Cohere, the leaderboard allegedly has a bunch of “systematic points which have resulted in a distorted taking part in area.” Among the many severe allegations raised by the researchers is the declare that the sector conducts “undisclosed personal testing” earlier than publicly releasing scores — and that rankings will be retracted at will.

Quickly after the paper’s launch, it was revealed that the model of Meta’s LLaMA 4 that had been utilized by the leaderboard wasn’t the identical one which had been launched publicly — a bait-and-switch ploy on Meta’s half to attraction the human voters behind the sector.

Although an apology was issued and Meta was thrown underneath the bus for its sketchy makes an attempt to rig the sport, it was nonetheless a very dangerous look that marred the chatbot area’s credibility. What which means for Grok, although? We’ll should ask the neatest AI on the planet.

Extra on Grok: The Pentagon Is Pumping $200 Million Into Elon Musk’s AI That Simply Had a Nazi Meltdown