Elon Musk’s xAI, on Tuesday, launched its latest LLM Grok 3. During the live-streamed event, the company showcased Grok 3’s “impressive” performance and suggested a future where AI not only understands the universe but also helps us understand it.
“If all goes well, SpaceX will send Starship rockets to Mars in two years with Optimus robots and Grok,” Musk said.
The name Grok, inspired by Robert Heinlein’s Stranger in a Strange Land, reflects a deep understanding of something. Independent benchmarks showed that Grok 3 outperformed Google Gemini 2 Pro, DeepSeek V3, Claude 3.5 Sonnet, and GPT-4 in tests such as AIME, GPQA, and LCB.
The Truth Behind Grok’s Success
xAI increased its compute capacity to boost Grok 3’s performance. The model was developed in two stages: initially, 122 days of synchronous training was done on 100,000 GPUs, followed by 92 days of scaling up to 200,000 GPUs.
“It took us 122 days to get the first 100K GPUs up and running, which was a monumental effort. We believe it’s the largest fully connected H100 cluster of its kind. But we didn’t stop there. We decided to double the cluster size to 200K,” said Igor Babuschkin, co-founder of xAI.
Like OpenAI’s o3 mini and DeepSeek R1, Grok-3 has advanced reasoning capabilities. An xAI representative stated that by taking the best pre-trained model and continuing its training with reinforcement learning, the model would develop additional reasoning capabilities, resulting in significant improvements in both training and testing performance.
The reasoning models are available through the Grok app, where users can prompt Grok 3 to “Think” or, for more complex inquiries, activate “Big Brain” mode, which utilises extra computational power for deeper reasoning. According to xAI, these models are particularly effective for tackling questions in mathematics, science, and programming.
The model beats OpenAI o3 mini (high), DeepSeek-R1 and Google Gemini 2 Flash Thinking models. However, some in the industry feel that it is not exactly a breakthrough.
Dharmesh Shah, founder and CTO of HubSpot, noted that it felt more like DeepSeek but with much more compute. He said he was looking forward to experimenting with the API, which would be launched in the following weeks.
Meanwhile, former OpenAI researcher and Eureka Labs founder Andrej Karpathy, who had early access to Grok 3, tested it and shared his insights. According to him, the model’s capabilities are somewhere around the state-of-the-art territory of OpenAI’s strongest models (o1-pro, $200/month) and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking.
He further added that it is quite an incredible feat, considering that the team started from scratch just about a year ago. “This timescale to reach state-of-the-art territory is unprecedented,” Karpathy said in a post on X.
Consulting firm Semianalysis reported that DeepSeek had access to around 50,000 NVIDIA GPUs, consisting of 10,000 H800 GPUs, 10,000 H100 GPUs, and a substantial number of H20 GPUs. It will be interesting to see what DeepSeek can accomplish if they can scale up to 200,000 GPUs.
Before the release of DeepSeek-R1, the AI research lab launched DeepSeek V3, which, according to the company, was trained on a cluster of 2,048 NVIDIA H800 GPUs with a budget of only $5.576 million. Dylan Patel, founder of semiconductor analysis firm Semi Analysis, said that DeepSeek is likely “bleeding out money”. “DeepSeek doesn’t have any capacity to actually serve the model,” he rued.
The Grok 3 model, along with chat capabilities, deep search, and advanced reasoning, will be available first to Premium Plus subscribers on X. For users seeking the most advanced capabilities and early access to new features, xAI will offer these through the dedicated Grok app and website, grok.com.
xAI shared that Grok completed pre-training in early January and said its early version of Grok 3 (codename ‘Chocolate’) had taken the top spot in the LMSYS Arena, becoming the first model to break the 1400 score barrier.
“Grok-3 has already reached 1400 (score); no other model has reached an ELO score that high,” said Musk, adding that the score is aggregated across all categories in chatbot capabilities, instruction following, and coding.
The live demonstration showcased Grok’s reasoning and creative problem-solving prowess. One of the challenges involved generating code for an animated 3D plot of a Mars mission. Moreover, Grok-3 also created a new game by mixing two games.
“We’re seeing the beginnings of creativity with Grok 3,” said Musk. “If you ask an AI to create a game like Tetris or Bejeweled, there are many examples on the internet for it to copy,” he added, saying that it is interesting that it achieved a creative solution combining the two games—that actually works and is a good game.
“Grok 3 might be the best base LLM for real-world physics!” said Yuchen Jin, co-founder & CTO of Hyperbolic Labs, who used it to create a Python script of a ball bouncing inside a spinning tesseract.
The Deep(Re)Search Feature
The company also introduced the DeepSearch feature, which allows users to ask complex questions and receive comprehensive answers, saving countless hours of research.
“It not only helps engineers and research scientists with coding but also assists everyone in answering questions they have day to day. It’s like a next-generation search engine that really helps you understand utilities,” the team said.
Interestingly, this appears to be inspired by OpenAI, Google, and Perplexity AI’s latest capability, Deep Research, a name all three have adopted. Its demonstration included queries about Starship launches, popular builds in Path of Exile, and even predictions for March Madness.
“The impression I get of DeepSearch is that it’s approximately around Perplexity’s Deep Research offering (which is great!) but not yet at the level of OpenAI’s recently released Deep Research, which still feels more thorough and reliable,” said Karpathy.
Will OpenAI Strike Back?
Moreover, Musk shared that the Grok app will introduce a new “voice mode” in about a week, allowing Grok models to have a synthesised voice. A few weeks later, Grok 3 models will be accessible through xAI’s enterprise API alongside the DeepSearch feature.
The Grok iOS update was released with Grok 3, which features new assets like “SuperGrok” and more. Grok Pro costs $30 per month or $300 per year and includes new Voice and Thinking mode assets.
Besides, xAI plans to open-source Grok 2 in the coming months. “Our general approach is that we will open-source the last version [of Grok] when the next version is fully out,” he said. “When Grok 3 is mature and stable, which is probably within a few months, we’ll open-source Grok 2.”
Notably, OpenAI is also considering some open-source projects. OpenAI CEO Sam Altman asked users on X: “For our next open-source project, would it be more useful to create an o3-mini-level model that is small but still requires GPUs, or the best phone-sized model we can develop?”
He also announced the roadmap for the upcoming GPT-4.5 and GPT-5 models. “Trying GPT-4.5 has been much more of a ‘feel the AGI’ moment among high-taste testers than I expected!” he posted on X.
Meanwhile, Anthropic is preparing to launch its next reasoning model, a hybrid AI that will allocate more computational power to complex queries while efficiently handling simpler tasks.