Text/Jinlu Editor/Zhou Xiaoyan Tencent Technology News on September 13, according to foreign media reports, on Thursday local time in the United States, OpenAI launched a new artificial intelligence model called OpenAI o1, which is also its first with "reasoning" capabilities. ..

2024-09-13 15:41:07 entertainment 4421℃

Tencent Technology News, September 13, according to foreign media reports, on Thursday, local time in the United States, openai launched a new artificial intelligence model called openai o1. This is also its first large model with "reasoning" capabilities. It can Analyze the problem step by step through a human-like reasoning process until you reach the correct conclusion.

openai o1 has two versions: o1-preview and o1-mini, which only supports text. It is launched to all plus and team users of chatgpt, and is launched to tier 5 developers in the API. According to the evaluation on the openai official website, this model is particularly good at handling mathematics and code problems, and even exceeds the accuracy of human PhD levels in benchmark tests on physics, biology and chemistry problems.

In addition, openai o1 has exceeded gpt-4o in multi-dimensional benchmark tests such as physics, chemistry, mathematics, logic and so on:

(comparison of gpt-4o and 01 benchmark tests, source: openai)

is smarter than a doctor. OpenAI O1 has aroused the curiosity of AI celebrities around the world. In addition to many OpenAI executives, NVIDIA senior scientist Jim Fan, New York University professor, well-known American AI scholar Gary Marcus, Carnegie Mellon University Computer science doctoral student James Campbell and others took a look at it and expressed their opinions on x.

We have summarized the opinions of 11 well-known entrepreneurs and scientists around the world who are concerned about AI. What is interesting is that the overall evaluation attitude is basically divided into two camps:

One camp is represented by OpenAI executives and researchers, most of whom are gave a "good review", believing that the new model has opened a new round of AI technology paradigm and helped large models move towards a more complex inference era; while most "non-staff people" outside of OpenAI gave relatively restrained evaluations, although there was no Deny the innovation of OpenAI O1, but they believe that the capabilities of the new model have not been fully tested, and it is still far away from AGI.

"Praise" camp: openai o1 opens a new technology paradigm

After openai released the preview version of openai o1 and its extremely fast version openai o1-mini, many executives and researchers of the company posted posts, believing that the new model will push AI to the next level An era of more complex reasoning.

openai CEO Sam Altman: openai O1 is our most powerful artificial intelligence model to date. Although it is not perfect and still has certain flaws and limitations, the first experience is impressive enough. More importantly, it also heralds the birth of a new paradigm—artificial intelligence has entered a new era capable of extensive and complex reasoning.

openai President Greg Brockman: openai O1 is our first model trained with reinforcement learning, which thinks deeply before answering a question. This is a new model with huge opportunities, both quantitatively (inference metrics have improved significantly) and qualitatively (through plain English "reading the model's mind", a faithful chain of thoughts makes the model more interpretable). Significant improvement.

This technology is still in its early stages and brings new security opportunities that we are actively exploring, including reliability, hallucination issues, and robustness to adversarial attacks.

thinking chain author and openai researcher Jason Wei: openai o1 is a model that thinks before giving the final answer. It not only trains the thinking chain through prompts, but also allows the model to better complete the thinking process through reinforcement learning.

In the history of deep learning, we have been expanding training calculations, but thinking chain is an adaptive computing method that can also be expanded during inference.

Although openai o1 appears to be very powerful in the tests of aime and gpqa, it may not necessarily be directly translated into effects that users can feel. Even for scientists, it's not easy to find hints that gpt-4 is inferior to openai o1, but once you find it, you will be very surprised. We all need to find more challenging tips.

Artificial intelligence uses human language to simulate thinking chains and excels in many aspects. The model can process problems like humans, such as breaking complex steps into simpler ones, identifying and correcting errors, and trying different approaches.

This field has been completely redefined.

OpenAI researcher Max Schwarzer: I have always believed that you do not need a GPT-6 level basic model to achieve human-level reasoning capabilities. Reinforcement learning is the key to AGI. Today, we have proof - openai o1.

As one of the founding members of openai, former Tesla AI senior director Andrej Karpathy’s comments were very different. He complained about the “lazy” problem of the model: openai o1-mini has always refused to I address the Riemann hypothesis, where model "laziness" remains a major problem.

NVIDIA senior research scientist Jim Fan and Carnegie Mellon University Computer science doctoral student James Campbell (James Campbell), although not an OpenAI staff member, also gave praise.

NVIDIA Senior Research Scientist Jim Fan: This may be the most important development in large language model research since the original chinchilla scaling law in 2022. The key is the synergy of two curves, not a single curve. People predicted that large language model capabilities would stagnate by extending the law of training scaling, but did not foresee that inference scaling is the key to truly breaking the diminishing returns.

I mentioned in February that any self-improving large language model algorithm failed to make significant progress after three rounds. No one can repeat AlphaGo's success in the field of large language models, where more computing power will reach superhuman levels. However, now we have turned a new page.

PhD student Campbell vs Altman: James Campbell, a PhD student in computer science at Carnegie Mellon University, posted a post showing the performance of the OpenAI O1 preview version at the American Mathematics Invitational Tournament (AIME), showing that it solved 83% of the questions. In comparison, gpt-4o only answered 13% of the questions. Campbell wrote: "It's all over!" In response, Openai CEO Sam Altman responded: "We will be back!" , it’s not that smart

huggingface CEO and co-founder clement delangue: Once again, artificial intelligence systems are not “thinking” but “processing” and “running” Predict” – just like Google or a computer. This technology often gives the false impression that these systems are as intelligent as humans, but this is just cheap publicity and marketing ploys to make you think they are smarter than they actually are.

New York University professor and well-known American AI scholar Gary Marcus: openai’s new model about gpt is indeed impressive, but:

. It is not AGI (artificial intelligence), and it is still far from this goal. Far.

. Read carefully and understand the details.There aren't many details on how it works, and what's been tested isn't fully disclosed. It is not fully integrated with the rest of gpt-4. (Why?)

. The full new model was not released to paying subscribers, only a mini version and a preview version were launched. Therefore, the industry has not been able to fully test it.

. The report shows that openai o1 performs well in many areas, but in some aspects the old model performs better. It's not a magical improvement across the board over the old model.

. We don’t know the specific training content, but even some basic tasks, such as tic-tac-toe, have problems.

.openai exaggerated its success on the law exam, and upon closer examination, these claims did not stand up to scrutiny. Scientific review takes time, and these results have not yet been peer-reviewed.

. What it claims to be able to accomplish in seconds might be surprising if you give it a month. But if you give it a highly specialized task, like writing complex software code, it can disappoint because OpenAI wants you to think it can do anything.

. Buyer beware.

Wharton Management Professor Ethan Mollick: I have been using "Strawberry" (openai o1) for 1 month and it is amazing in many ways, but also somewhat limiting. Perhaps most importantly, it's a signal of where things are headed in the future.

The new artificial intelligence model is called "o1-preview" (why are artificial intelligence companies always so bad at naming?), and it will "think" about the problem before solving it. This makes it possible to solve complex problems that require planning and iteration, such as novel mathematical or scientific puzzles. In fact, it can now even outperform human PhD experts in solving extremely difficult physics problems.

To be clear, "o1-preview" does not perform better in every way. For example, it's not stronger than gpt-4o. But for tasks that require planning, the performance is very good. For example, I gave it this instruction: Refer to the paper below, consider the teacher and student perspectives, and figure out how to build a teaching simulator using multiple agents and generative AI. Write code and detail your approach. Then I posted the full text of our paper, with the only reminder to build the complete code. You can see the system-generated results below.

Evaluating these complex outputs is really difficult, so the easiest way to demonstrate the benefits (and limitations) of the strawberry model is with a game: the crossword puzzle. I took 8 clues from a very difficult crossword puzzle and translated them into text (since I can't see the images yet). Why not try this puzzle yourself, I bet you will find it challenging.

Crossword puzzles are particularly tricky for large language models because they require trial and error: trying and eliminating many interrelated answers. Large language models cannot do this because they can only add one token to their answer at a time. For example, when I gave this puzzle to claude, he first gave the answer "star" (wrong), then used this wrong answer to try to solve the rest of the puzzle, and finally couldn't even guess the answer. Without a planning process, it’s just trial and error.

But what would happen if I gave this puzzle to a strawberry? The model first "thought" for a full 108 seconds (most problems can be solved in less time). You can see the train of thought here, a sample below (there are many more I haven't included), and the train of thought is very enlightening - well worth the time to read.

The big language model iterates repeatedly, creating and eliminating ideas, and the results are usually pretty good. However, "o1-preview" still seems to be based on gpt-4o, which is a bit too literal to solve this harder puzzle. For example, "galaxy clusters" don't refer to actual galaxies, but to Samsung galaxy phones (which also confuses me) - the answer is "apps". It keeps trying actual galaxy names before deciding if it's a coma (a real galaxy cluster). Therefore, the remaining results, while creative, are not entirely correct and do not follow the rules.

To try and take it a step further, I decided to give it a hint: "1 down is apps." The AI took another minute. Again, in its thinking sample (see left), you can see how it iterates ideas. In the end, it gave exactly the right answer, solving all of the puzzle's hints, although it did generate a new clue - which was not the puzzle I gave it.

So, what "o1-preview" does is not possible without "strawberry", but it's still not perfect: bugs and illusions still exist, and it's still limited by gpt-4o as the underlying model "intelligent". Since getting the new model, I haven't stopped using the Claude to comment on my posts, the Claude still performs better stylistically, but I have definitely stopped using it for any complex planning or problem-solving tasks. "o1-Preview" represents a huge leap forward in these areas.

Using "o1-preview" means facing a paradigm change in artificial intelligence. Planning becomes a form of agency, with AI coming up with solutions on its own without our help. It can be seen that the AI did a lot of thinking and produced complete results, and the role of being a human companion feels diminished. Artificial intelligence will complete the task autonomously and then give the answer. Sure, I could find errors by analyzing its reasoning, but I no longer felt connected to the AI's output or played an important role in shaping the solution. That's not necessarily a bad thing, but it's certainly a change.

As these systems continue to upgrade and move toward truly autonomous agents, we need to figure out how to stay in the loop - both to catch errors and to get to the heart of the problem we are trying to solve. The “o1-preview” showcases AI capabilities that may never have been seen before, even if it currently has some limitations. This leaves us with a key question: As artificial intelligence develops, how can we work better with it? This is a problem that "o1-preview" cannot solve yet.

is different from the direct praise and criticism opinions of many people. As conversational search engine Perplexity CEO, Aravind Srinivas tried to "guess" the principle of the OpenAI O1 model. He does not believe that the answers given by "Strawberry" are the result of careful refinement through repeated critical feedback, and believes that relying solely on large language models to arrive at answers is not reliable enough. He suggested that incorporating code execution into the mix, combined with facts extracted directly from the knowledge graph, might be more effective in practical applications.

Overall, openai o1 has longer "thinking" time and is better at dealing with more standardized "science" problems, which is a good thing for the development of the scientific field. But in addition to logical reasoning problems with clear distinction between right and wrong, there are many problems in the world for which there is no standard answer. For all living beings, there are a thousand Hamlets in the eyes of a thousand people. If the large model can solve such non-standard answers for mankind, Maybe we can get closer to the real agi by solving the problem.