OpenAI unveils a model that can fact-check itself

Sep 12,2024

ChatGPT maker OpenAI has announced its next major product release: An generative AI model code-named Strawberry, officially called OpenAI o1.

To be more precise, o1 is actually a family of models. Two are available today, o1-preview and o1 mini, in both chatbot form and OpenAI’s API.

o1 avoids some of the reasoning pitfalls that normally trip up generative AI models, at least according to OpenAI. That’s because o1 can effectively fact-check itself by spending more time considering all parts of a command or question.

OpenAI says that o1, originating from an internal company project known as Q* , is particularly adept at solving math and programming-related challenges. But what makes the text-only o1 “feel” qualitatively different from other generative AI models is its ability to “think” before responding to queries.

When given additional time to “think,” o1 can reason through a task holistically — planning ahead and performing a series of actions over an extended period of time that help it arrive at answers. This makes o1 well-suited for tasks that require synthesizing the results of multiple subtasks, like detecting privileged emails in an attorney’s inbox or brainstorming a product marketing strategy.

Here’s how OpenAI describes it in a blog post : “o1 [was] trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers — it can produce a long internal chain of thought before responding to the user.”

OpenAI’s internal benchmarks show o1 achieving a score of 94.8 on MATH-500, a mathematics benchmark — compared to the 60.3 score GPT-4o achieves (higher is better).

TechCrunch wasn’t offered the opportunity to test o1 before its debut; we aim to get our hands on it as soon as possible. But according to a person who did have access — Pablo Arredondo, VP at Thomson Reuters — o1 is better than OpenAI’s previous models (e.g. GPT-4o) at things like analyzing legal briefs and identifying solutions to problems in LSAT logic games.

“We saw it tackling more substantive, multi-faceted, analysis,” Arredondo told TechCrunch. “Our automated testing also showed gains against a wide range of simple tasks.”

Now, there is a downside. o1 can be slower than other models, query depending; Arredondo tells us the model can take over ten seconds to answer some questions. (Helpfully, the chatbot version of o1 shows its progress by displaying a label for the current subtask it’s performing.)

Given the unpredictable nature of generative AI models, o1 likely has other flaws and limitations. We’ll no doubt learn about these in time — and once we get a chance to test the model ourselves.

We’d be remiss if we didn’t point out that OpenAI is far from the only AI vendor investigating these types of reasoning methods to improve model factuality. Google DeepMind researchers recently published a study showing that, by essentially giving models more compute time and guidance to fulfill requests as they’re made, the performance of those models can be significantly improved without any additional tweaks.

OpenAI might be first out of the gate with o1. But assuming rivals soon follow suit with comparable models, the company’s real test will be making o1 widely available at a reasonable price.