Grok-3: The first LLM scoring over 1,400 on Chatbot Arena

Grok-3: The first LLM scoring over 1,400 on Chatbot Arena

Grok Business Overview

Grok's Name Origin and Conceptual Meaning

  • Grok is a word from Robert Heinlein's novel "Stranger in a Strange Land," used by a character raised on Mars. It means to fully and profoundly understand something, with empathy being an important aspect.
  • The mission of XAI and Grok3 is to understand the universe, including fundamental questions about aliens, the meaning of life, the universe's end, and its beginning. The interviewee stressed that to understand the universe, one must rigorously pursue truth to avoid delusion or error.
  • Grok3, as an order of magnitude more capable than Grok2, was developed in a short period. The company is eager to attract some of the smartest individuals to join their team. The interviewee mentioned that Grok3 is the AI tool being developed at XAI, and the team has been working diligently to improve it over the past few months to make it accessible to users. The company's goal is to provide users with access to this advanced AI tool, which they believe will be both engaging and entertaining to interact with.

XAI Rapidly Advances AI Technology

  • XAI started with their first model 17 months ago, featuring 314 billion parameters. They swiftly progressed from version 1 to version 1.5 (released in November 2023) and then to version 2. This rapid advancement is attributed to a strong engineering team, top AI talent, and significant computational power. XAI believes obtaining the best pre-trained model is insufficient for building the best AI. They focus on additional capabilities like contemplating all possible solutions, self-critique, verifying solutions, backtracking, and thinking from first principles.
  • XAI continues training the best pre-trained model with reinforcement learning to elicit additional reasoning capabilities, allowing significant improvement and scaling both in training and test time. The company takes a hands-on approach to AI development, including physically working with GPU clusters. This level of involvement, such as unplugging cables to test system stability, is believed to set them apart from other AI teams and enhances their ability to ensure reliability in their training setup.

XAI Expands Computational Power

  • XAI initially faced challenges with their GPU cluster, starting with around 8,000 training chips and facing cooling and power issues. They have since expanded to over 100,000 GPUs. In April 2024, the company decided to build its own data center, completing the first 100,000 GPU cluster in 122 days. They then doubled the capacity, completing the expansion in 92 days. This expansion has led to Grok 3, which has 10-15 times more compute power compared to the previous generation.
  • The team faced numerous technical issues during cluster setup, including BIOS mismatches, networking problems, and maintaining cluster health throughout the training process. Achieving Grok 3's capabilities required mastery of deep learning science and engineering at every level. XAI envisions a future where they might have a computer utilizing their entire cluster for a single, very important problem during test time, similar to the concept of Deep Thought.

Grok Represents XAI's Flagship AI

  • Grok is XAI's AI that has been significantly improved over the last few months. The company has made a substantial jump in capabilities and is now able to provide access to users. Grok has achieved creative solutions, combining two games into a good, functional game. XAI has been working hard over the last few months to improve Grok's capabilities significantly. They are aiming to provide access to Grok for all users. Grok has made a substantial jump in capabilities, with improvements of more than 10x in some areas.

Grok Demonstrates Versatile Problem-Solving

  • Grok demonstrated creative problem-solving by combining two games, Tetris and Bejeweled, into a new game that actually works well. Despite being trained on specialized tasks like mathematics and coding, Grok has shown the ability to work on various other tasks, including game creation. The AI seems to have developed generalized abilities to detect and correct its own mistakes, persist on problems, and select the best solutions.

Product Aspects

XAI Develops DeepSearch

  • X.AI is introducing a new product called DeepSearch, described as the first generation of Grok agents. DeepSearch is positioned as a next-generation search engine designed to help users understand the universe by answering day-to-day questions. It operates by conducting a single search using the current rack system, providing a high-level progress bar and bullet summaries of the model's actions, including which websites it's browsing and what sources it's verifying.
  • The system cross-validates different sources to ensure the accuracy of its final answer. This approach is expected to save users hundreds of hours of Google search time when researching specific topics. XAI is focusing on developing the best reasoning model, allowing it to think harder, longer, and more broadly. They are excited about providing more tools to the model, similar to how humans use various resources to solve problems.

Grok3 Capabilities and Performance

  • Grok 3 finished pre-training in early January and is still currently training. It has been evaluated across three categories: general mathematical reasoning, general knowledge about STEM and science, and computer science coding. The model, including its smaller counterpart Grok 3 Mini, is performing exceptionally well, reaching the frontier among competitors in benchmarks such as the American Invitational Math Examination. In the blind test on Chatbot Arena, an early version of Grok 3 achieved an ELO score of 1400, ranking number one across all categories including chatbot capabilities, instruction following, and coding. The score is still climbing, and a newer version is believed to be even better.

Grok3 Features and Functionality

  • Grok3's deep search feature allows users to view subtasks and scroll through the AI's thought process, making the information retrieval process transparent. This functionality is more powerful than traditional search engines, as users can specify source restrictions and steer the search intelligently. Grok3 can complete tasks that might take a human 30 minutes to an hour in just 10 minutes, potentially producing better results than manual research. Users can initiate multiple tasks simultaneously and review results shortly after.

R&D Aspects

XAI Implements Advanced Training Methodology

  • XAI is utilizing multiple chains of thought simultaneously, which is a powerful technique allowing them to continue scaling model capabilities after training. This approach helps address concerns about overfitting to benchmarks and improves generalization. To address concerns about AI models merely memorizing textbooks or GitHub repositories, a blind test of Grok 3 (code-named "Chocolate") was initiated on the Chatbot Arena platform.
  • This platform allows for raw comparison of AI engines by stripping away product surfaces. Users submit queries and receive two anonymous responses, then vote on their preference. The question of whether AI assistants like Grok3 have gender or relationship status was raised. The response indicated that such AI can be perceived as whatever the user wants it to be, suggesting a flexible approach to AI personification. It was noted that people developing emotional attachments to AI assistants is highly probable.

Grok's Training and Development Focus

  • Grok's reasoning abilities were primarily trained on math problems and competitive coding problems. This focused training approach has yielded interesting results in terms of the AI's problem-solving capabilities. Grok models can spend more time reasoning about a problem before providing an answer, which often leads to improved performance. This is represented by shaded bars in performance graphs, indicating the model's ability to think longer and potentially solve problems more accurately.

Grok's Rapid Progress and Future Potential

  • Seventeen months ago, Grok 0 and Grok 1 could barely solve high school problems. Now, Grok has progressed to a level where it's ready for college-level challenges. The company anticipates that human exams may soon become too easy for Grok. To test Grok's generalization capabilities, the company had their models compete on a fresh AME 2025 exam. Grok 3 reasoning, the larger model, performed better on this new exam compared to the smaller model, demonstrating stronger generalization capabilities.

Infrastructure Development for Grok

  • The company built its own data center in 122 days, a fraction of the 18-24 months quoted by providers. They repurposed an abandoned factory, leased generators and cooling systems, and implemented a liquid-cooled system for the GPUs. The project required innovative solutions, including reprogramming Tesla Megapacks to handle dramatic power fluctuations. One of the most difficult aspects of the project was getting the entire model training coherently on the H100 GPUs. This involved managing potential hardware issues like cosmic rays affecting transistors, orchestrating hundreds of thousands of GPUs, and dealing with potential failures at any time.

Channel Aspects

Grok3 Platforms and Accessibility

  • Grok3 is accessible via grok.com website and the iOS app store. The web version on grok.com will be the most advanced and up-to-date, as app store approval processes can delay updates. The most powerful version of Grok3 will be the web version due to fewer limitations compared to phone formats. Grok3 is being released with features including the base model with chat capabilities, deep search, and advanced reasoning modes.

Subscription Models Introduced

  • The company is introducing a separate subscription called "Super Grok" for enthusiasts seeking the most advanced capabilities and earliest access to new features. This subscription is available for the dedicated Grok app and website. Initial access to Grok3 is granted to premium plus subscribers on X. Users are advised to update their X app to access advanced capabilities.

Recent Strategy

Grok3 Release and Improvement Strategy

  • The company emphasizes that the initial release is a beta version, with rapid improvements expected almost daily. A more polished version may be available within a week. Voice interaction capabilities are in development and expected to be released in about a week, offering a conversational experience with Grok3. The voice assistant feature is nearly ready for release, pending some final polishing. The Grok3 API, including reasoning models and deep search capabilities, will be available in the coming weeks.

Open-Sourcing and Future Development Plans

  • The company's general approach is to open-source the previous version when the next version is fully released and stable. For example, Grok 2 will be open-sourced when Grok 3 is mature and stable, which is expected to be within a few months. The company has already started work on the next cluster, which will be approximately 5 times more powerful, using about 1.2 gigawatts of power. This new GV200/300 cluster is expected to be the most powerful training cluster in the world。

Read more