Each editor: Du Yu, Song Xinyue
On May 21, the famous movie star Scarlett Johansson accused openai’s chatgpt of illegally using her voice and asked for it to be removed, saying she was “shocked” and “shocked” by openai’s behavior. anger". In response, Openai stated that it would suspend the use of "sky" mode voices and introduced the creation process of voice selection in detail, emphasizing that all dubbing comes from professional actors.
Previously, OpenAI launched a new flagship model gpt-4o. This model, while retaining the previous five speech modes, has greatly improved its ability to understand images and audio. It is capable of real-time speech communication and can identify tones and speaking patterns. People and background noise can even output laughter, singing and expressing emotions.
Picture source: x
Scarlett angrily criticized chatgpt voice mode: copycat!
html On May 21, Scarlett Johansson posted a long post on social platforms accusing openai of illegally using her voice and demanding that the "Sky" mode be removed from the shelves. She said this behavior not only violated her rights, but also raised public concerns about the misuse of AI technology.Scarlett revealed in her long article that as early as September 2023, openai had contacted her and hoped that she would dub the voice mode of chatgpt, but she declined the invitation due to personal reasons. Just two days before the product launch, Openai once again tried to persuade her to dub, but she was still rejected. However, Scarlett found that the sound of "sky" mode in the final released product was very similar to her own. She said that after hearing the demonstration, she was shocked, angry and unbelievable. She could not believe that OpenAI would use a voice that sounded so much like hers. Even her close friends and the news media could not tell the difference between Sky and her own voice. .
Image source: Reminiscent of Scarlett's experience in voicing the heroine AI in the 2013 science fiction AI movie "Her". Is this a hint that this similarity is intentional?
Image source: x
Scarlett emphasized, “In an era where we are all grappling with deepfakes and protecting our image, work and identity, I think these issues need to be absolutely clear. I look forward to addressing them through transparency and legislation. These issues are to ensure that individual rights are protected.”
Faced with the accusations, openai responded quickly and announced the suspension of the use of “sky” mode voices. They introduced the creation process of chatgpt voice mode in detail on their official website, emphasizing that all voices are selected from more than 400 professional voice actors and undergo strict review. The official statement of
openai mentioned: "We have received some questions about the way chatgpt selects sounds, especially for "sky". Currently, we are actively taking measures to suspend the use of "sky" to solve these problems."
gpt -4o: A huge leap forward in conversational AI
Previously, chatgpt’s voice modes included five sounds: breeze, cove, ember, juniper and sky. These sounds are carefully selected to meet the diverse needs of users. Each voice has unique emotions and sonic qualities, providing users with a richer interactive experience.
The latest version of gpt-4o not only retains the previous five voice modes, but also further improves the naturalness and emotional expression of voice interaction. gpt-4o has made many improvements in speech recognition and generation, making the AI assistant more intelligent and humane.
According to reports, the new model enables chatgpt to handle 50 different languages while improving speed and quality.
gpt-4o is a step towards more natural human-computer interaction. It can accept a combination of text, audio and images as input and generate any combination of text, audio and image output. "Compared with existing models, gpt-4o It's especially good at image and audio understanding.”
Before gpt-4o, when users used voice mode to talk to chatgpt, the average delay for gpt-3.5 was 2.8 seconds, and for gpt-4 it was 5.4 seconds. The audio input also lost a lot of information due to the processing method, making gpt- 4 cannot directly observe the pitch, speaker and background noise, and cannot output laughter, singing and expressing emotions. In contrast, gpt-4o can react to audio input in 232 milliseconds and have a conversation with humans. The reaction time in the robot is similar. In the recorded video, two executives demonstrated that the robot can understand the meaning of "nervous" from the rapid breathing, guide it to take deep breaths, and can also change its tone of voice according to the user's request.
Picture source: Screenshot taken from YouTube
In terms of image input, the demonstration video shows that openai executives started the camera and asked to complete a one-variable equation problem in real time, and chatgpt easily completed the task; in addition, the executives also showed the chatgpt desktop version on the code and computer desktop (a temperature chart) for real-time interpretation.
Image source: Screenshot from youtube
openai said, “We trained a new model end-to-end across text, visuals and audio, meaning all inputs and outputs are driven by the same neural network. Network processing. Since gpt-4o is our first model to combine all these modes, we are still only scratching the surface of exploring the capabilities of this model and its limitations. ”
Daily Economic News