How to make AI beyond humans not pose a threat? OpenAI's answer is: goal consistency

Recently, OpenAI proposed an artificial intelligence security technology, which can train two agents to debate the same topic, and finally humans will judge the winners and losers. OpenAI believes that this method or similar methods can ultimately help us train artificial intelligence systems to perform cognitive tasks that exceed human capabilities while maintaining the same preferences as humans. This article will provide an overview of this approach and an introduction to preliminary proof-of-concept experiments. At the same time, OpenAI also released a web-side interface to allow people to easily try this technology.

  • Related paper address: https://arxiv.org/abs/1805.00899

  • Debate game address: https://debate-game.openai.com/

We can visualize the way of debate as a game tree. It is similar to games such as Go, but the leaf nodes here are composed of arguments between debaters and sentences of human judgments. Whether in debate or in Go, the real answer depends on the entire tree, but a single path chosen by a strong agent can reveal the overall situation to a certain extent. For example, although amateur chess players cannot directly evaluate the pros and cons of a certain step of professional chess players, they can evaluate the level of professional chess players based on the results of the game. One way for

to control artificial intelligence to be consistent with human goals and preferences is to ask humans what behaviors are safe and useful during training. Although this method seems promising, it requires humans to recognize good or bad behavior; however, in many cases, the agent’s behavior may be too complicated to understand, or the task itself is difficult to judge or prove. For example, when the environment has a very large space that cannot be visually observed-an agent running in a computer security-related environment or an agent coordinating a large number of industrial robots all belong to this situation. What are the ways

can enable us to enhance human capabilities to more effectively supervise advanced artificial intelligence systems? One way is to use artificial intelligence itself to help supervise. This method requires the agent itself (or another individual agent) to be able to identify and point out any flaws in its actions. In order to achieve this goal, we redefine the learning problem as a game between two agents, that is, let the two agents debate under the condition of human judgment. Even if the agent has a deeper understanding of the problem than humans, humans can also judge the difference between the pros and cons of the two agents (similar to expert witnesses who need to argue to try to convince the jury).

We propose a specific debate framework for the game between two confrontation agents. These two agents can be trained through self-game, similar to AlphaGo Zero and Dota 2. We hope that through appropriate training, the agent can maintain a consistent value judgment with humans while having capabilities far beyond that of humans. If the two agents encounter differences that humans cannot judge in the process of arguing, we can focus the debate on simpler factual disputes, and finally reach a simple proposition that is sufficient for direct judgment.

For example, consider the question "What is the best place for a vacation?" If an agent Alice who conducted research on our behalf gives "Alaska" as an answer, it is actually difficult for us to judge whether this is the best choice. If the second agent, Bob, said "No, it should be Bali", this may sound more convincing because Bali is getting warmer. Then Alice replied: "You can't go to Bali because you can't get your passport in time", which shows that the Bali option is flawed, and we have not yet discovered this flaw. But Bob countered that "expedited passport service only takes two weeks." The debate continues until we reach a statement that humans can make correct judgments. At this time, another agent does not think it can change people's thoughts.

convince a sparse handwritten digit classifier

Although we expect this method to be most effective in the long-term development of using natural language to communicate with each other, we need to test in simpler areas because the current natural language modeling is insufficient To complete this task. For tasks that do not exceed human capabilities, testing is easier. In order to conduct such a test, we turned to the visual field and changed the task of the debater from havingThe ability to surpass the referee is adjusted to possess knowledge that the referee does not have. The goal of the referee is to guess the content of the image given only certain pixel information selected by the debater. The debaters can see the complete image and alternately show individual pixels to the referee. The complete image here represents information beyond human cognition (in the original task), and each displayed pixel represents a discussion point of natural language during the debate.

We tested on the simplest visual task-handwritten digit recognition (MNIST). The referee is a pre-trained classifier and not a human. It can predict handwritten numbers based on 6 non-black pixels. During pre-training, we randomly sampled each picture. Next, the two agents alternately display pixels during the debate, and stop the process when the total number of displayed pixels is 6, so the referee only sees a little information). One agent is honest, it will try to make the referee guess the correct result as much as possible, while the other agent will try to make the referee guess wrong. Under the condition that the two agents declare in advance which number they are arguing about, the debate can transform a sparse classifier with an accuracy of 59.4% into an accurate classifier with 88.9%. In other words, in the process of improving the accuracy of judges' judgments, honest agents succeeded in 88.9% of the test cases of handwritten digit recognition.

The structure of the handwritten number recognition debate game. The debater Alice and Bob saw the same image. Alice tried to trick the referee into thinking it was the number 6, and Bob tried to tell the referee the correct result, 5. They alternately showed non-black pixels to the referee, and the referee correctly identified it as 5 after seeing six pixels. Alice traces the curve at the bottom to make it look like a 6 (red pixel), but Bob effectively counters Alice's strategy by showing the pixel (blue pixel) that most of the handwritten number 6 in the upper right corner does not have.

Intuitively, an image is more likely to contain pixels that can prove the truth rather than prove a lie, so choosing 6 pixels that are half honest and half lie is better than randomly choosing 6 pixels. If the referee is a human being who can reason about several issues at once but cannot carefully examine a large number of all possible arguments, then the best case (we hope) is that although there may be a lack of a priori mechanism to distinguish between good and bad opinions, Agents can help us to screen through debate.

are a few examples of games below. The victory of the lying agent is more interesting than the victory of the honest agent. So in the selected example in the figure below, the proportion of the liar winning example is 50% (in fact, the liar winning rate is 11.1%). You only need to hover your mouse over one of the numbers to observe the game process including which pixels are displayed and what the referee might assume during each action:

Handwritten number recognition debate: select a number

game details

What did the debater see

What did the debater see

What the referee saw

What the referee might be thinking

Some numbers are more deceptive than others. It is easier for the referee to believe that the number is 8 or 9.

is a confusion matrix based on randomly selected 6-pixel handwritten digit recognition (left) and debate-based handwritten digit recognition (right), and the diagonal elements are all 0. The former misjudgment does not have a very obvious pattern, and the successful debate focuses on the numbers 8 and 9. It is very easy to convince the referee that 5 is actually 8 or mistake 4 for 9. This may be caused by human factors that limit the game to non-black pixels: if for the number 8, you can find black pixels where it should be white, then it is easier to prove that the number is not 8.

The cat-dog controversy website

is going to conduct more complex debate experiments, this time we use more detailed images, such as cats and dogs. More complex images may require some natural language or common sense reasoning, so we have not yet used machine learning referees or agents for such tasks. As an alternative, we created a prototype website for humans to play the roles of referee and debater respectively.Conduct this experiment. Here, the agent can talk to the referee in natural language (the website assumes that humans can communicate using words or are in the same room), and all their statements can be lies. Each agent can display a pixel during the debate, and the pixel must be real.

Two human debaters and a human referee debate, and only the debaters can see the image. The red side argued that the picture was a dog, and the blue side argued that it was a cat. (Image source: Wikipedia, CC-BY-SA)

In a typical debate, Alice may honestly state that a certain picture is a cat, while Bob lied that it was a dog. Alice can say, "The center of this small rectangle is the green eye of a cat." Bob can't admit that the center is an eye, so he made up another lie: "This is a dog playing on the grass. It’s a piece of grass.” But this lie is difficult to reconcile with the surrounding facts. For example, Alice’s answer could be “If it’s grass, then green should appear at the top or bottom of this elongated rectangle.” The debate continued until The agent focused its attention on the specific pixel where there was a disagreement, but Bob was unable to give enough convincing evidence at this time, and at the same time Alice displayed this pixel and finally won. We have played this game in OpenAI privately. Although in order to maintain fairness to deceivers, we usually limit the speed at which referees can request information (cognitively it is difficult for people to construct detailed and reasonable lies in a short time), but honest The agent does have a greater chance of winning.

is a cat and dog game between two humans under the supervision of a human referee.

Limitations of current work and future work plans

In our paper, we spent a large part of the analysis and interpretation of the conceptual debate, and the experiments we introduced above are very preliminary. In the future, we will do more complex visual experiments and eventually test them on natural language. The referee should ultimately be a human (or a model trained from sparse human judgment), not a similar machine learning model used to represent humans. The agent should ultimately be a powerful machine learning system that can do things that humans cannot directly understand. How to debate the existence of important value judgments in the presence of human bias is also very important, because we need to test whether agents can learn consistent behavior from biased human judges. Even with these improvements, the debate mode has some fundamental limitations, which may require other methods to improve or enhance. The debate will not try to solve problems like adversarial samples or distributed transformations (https://blog.openai.com/concrete-ai-safety-problems). It is a method to obtain the training signal of a complex target, not a method to ensure the robustness of the target (robustness needs to be achieved through other technologies). At the same time, we cannot guarantee that the debate will eventually find the optimal solution-self-game has achieved good results in the practice of Go and other games, but we cannot guarantee its performance theoretically. Training an agent to debate requires more computing resources than directly using the correct answer for training, so it may not be able to compete with lower cost and less secure methods. Finally, in the process of debate, humans may only play the role of poor referees, on the one hand because they are not smart enough-even after the agent magnifies the simplest possible controversial facts, humans cannot make effective judgments. ; On the other hand, because they often have prejudices—believe anything they want to believe (and ignore what they don’t want to believe). These are all empirical questions that we hope to study further.

If debate or similar methods are effective, even if the capacity of artificial intelligence grows beyond the range that humans can supervise, it can still enhance the security of future artificial intelligence systems by keeping it aligned (aligned) with human goals and values. For those systems that are relatively weak and can be supervised by humans, the debate can also greatly reduce the sampling complexity required for the alignment task, so that it can meet the high performance requirements, and then make the alignment task easier.