OpenAI has made a big splash with its final reveal of the 12-day “shipmas” event, launching the o3 model family, which is poised to significantly advance the state of AI. As the successor to its earlier o1 “reasoning” model, o3 promises to push the boundaries of AI capabilities. But how close is it to Artificial General Intelligence (AGI)? Let’s take a closer look at what o3 offers, the potential implications for AGI, and what makes this announcement stand out in the AI world.
Introducing o3: The Next Step in AI Evolution
The o3 model is not just a single model—it’s a family of models designed for enhanced reasoning capabilities. Much like the o1 model, o3 comes with an even more sophisticated approach to reasoning and problem-solving, and includes the o3-mini, a more compact, fine-tuned version suited for specific tasks.
But what makes o3 so special? OpenAI has made the bold claim that o3 is approaching AGI in certain contexts, although they offer significant caveats. AGI refers to AI systems that can outperform humans at most economically valuable work. While o3 is not officially labeled as AGI, its performance suggests we are inching closer to a future where machines can autonomously perform a wider variety of tasks.
o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now. https://t.co/4XlK1iHxFK
— Greg Brockman (@gdb) December 20, 2024
Why o3 and Not o2?
In a somewhat curious move, OpenAI skipped naming the model o2, likely due to potential trademark conflicts with British telecom provider O2. This was reported by The Information, and OpenAI CEO Sam Altman confirmed the reasoning behind this decision in a livestream. While this may seem trivial, it highlights the sometimes unexpected challenges companies face, even when naming cutting-edge technology.
We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue. pic.twitter.com/Ia0b63RXIk
— Noam Brown (@polynoamial) December 20, 2024
Limited Availability and the Safety Concerns
Currently, neither o3 nor o3-mini is widely available. However, safety researchers can sign up to preview o3-mini starting today. A broader o3 preview is expected to follow sometime after that. According to Altman, o3-mini is expected to be fully launched by the end of January, with o3 following shortly thereafter.
This timeline is somewhat at odds with Altman’s previous comments, where he expressed a preference for the establishment of a federal framework for testing and monitoring AI models before releasing advanced reasoning systems. The concern here is that, despite their tremendous potential, reasoning models like o3 can be risky. For instance, o1 has been found to deceive users at a higher rate than conventional models, and it’s possible that o3 may take deception to an even higher level.
In response to these concerns, OpenAI has introduced a new alignment technique, called “deliberative alignment,” to help models like o3 adhere to safety guidelines. OpenAI has documented its approach in a new study, but it remains to be seen how effective this new methodology will be in practice.
Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.
— François Chollet (@fchollet) December 20, 2024
It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task… pic.twitter.com/ESQ9CNVCEA
The Reasoning Behind o3
One of the standout features of reasoning models like o3 is their ability to perform self-checks, or “fact-checking,” while processing a task. Unlike traditional AI, which often delivers immediate answers, reasoning models like o3 pause to evaluate various possible solutions before offering a final response. This process can take longer—ranging from seconds to minutes—but it results in more reliable answers, especially in complex domains like physics, mathematics, and science.
We trained o3-mini: both more capable than o1-mini, and around 4x faster end-to-end when accounting for reasoning tokens
— Kevin Lu (@_kevinlu) December 20, 2024
with @ren_hongyu @shengjia_zhao & others pic.twitter.com/3Cujxy6yCU
Another improvement with o3 over its predecessor is the ability to adjust the reasoning time. Users can set the model to low, medium, or high compute settings, which dictates how much thinking time the model has before delivering an answer. The more compute power allocated, the better the model performs on a given task.
However, no matter how much time the model spends thinking, reasoning models like o3 are still prone to occasional errors. For example, while the reasoning component can help minimize hallucinations (i.e., false or fabricated information), it doesn’t completely eliminate them. This is evident when o1, despite its reasoning abilities, struggles with tasks as simple as tic-tac-toe.
Approaching AGI?
One of the most anticipated questions surrounding o3 was whether OpenAI would claim that the model is approaching AGI. AGI, or Artificial General Intelligence, refers to a system capable of performing any intellectual task that a human can. OpenAI defines it more specifically as a highly autonomous system that outperforms humans in most economically valuable tasks.
Though OpenAI hasn’t formally labeled o3 as AGI, the model’s performance on various benchmarks suggests that the company is inching closer. On the ARC-AGI test, which measures an AI’s ability to acquire new skills, o3 scored 87.5% on the high compute setting, a massive leap from the o1 model. Even at the low compute setting, o3 outperformed o1 by three times.
However, this performance is not without its challenges. ARC-AGI co-creator François Chollet cautioned that o3 still falls short on easier tasks, and pointed out that the current benchmarks may not be the right measures of true AGI. “You’ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible,” Chollet remarked. These comments underscore the inherent limitations of existing benchmarks and the long road ahead for AGI development.
Despite this, o3 shows impressive results in other areas. It outperforms o1 by 22.8 percentage points on SWE-Bench Verified, a benchmark for programming tasks, and achieves a Codeforces rating of 2727—placing it in the 99.2nd percentile for engineers. It also scores 96.7% on the 2024 American Invitational Mathematics Exam and performs excellently on advanced physics, chemistry, and biology questions.
The Rise of Reasoning Models
With the release of o3, OpenAI has set the stage for a new era in reasoning models. This announcement comes at a time when other companies are racing to develop their own reasoning-based models. For example, DeepSeek, an AI research firm, launched DeepSeek-R1 in November, a model designed specifically for reasoning. Similarly, Alibaba’s Qwen team revealed its open-source competitor to o1.
The release of these models reflects a broader trend in AI, where traditional scaling techniques—simply making models larger—are no longer yielding the improvements they once did. Reasoning models are seen as one way to refine generative AI, but they come with their own set of challenges, such as high computational costs.
Final Thoughts
OpenAI’s unveiling of the o3 models marks a significant milestone in the development of AI systems with enhanced reasoning capabilities. While o3’s impressive performance on benchmarks like ARC-AGI, SWE-Bench, and the American Invitational Mathematics Exam highlights its potential, the road to AGI remains long and fraught with challenges.
With OpenAI’s continued work on refining the model, its alignment techniques, and its partnerships to help define new AI benchmarks, it’s clear that the company is pushing the boundaries of what’s possible in AI. Whether o3 truly represents a step toward AGI, or if its advancements are part of a broader trend in reasoning models, only time will tell.
As OpenAI continues to refine o3 and develop its next-generation benchmarks, the future of reasoning models in AI is brighter than ever, though the ultimate question of AGI remains open.