Author: Wudao Nanlai (senior media person) Source: "Young Reporter Magazine" WeChat public account Introduction: The New York Times's advocacy for generative AI rights has given us profound enlightenment: News media and large model companies must both fight and cooperate , the st

entertainment 5759℃

Author: Wu Daonanlai (senior media person)

Source: "Young Reporter Magazine" WeChat public account

Author: Wudao Nanlai (senior media person) Source: 'Young Reporter Magazine' WeChat public account Introduction: The New York Times's advocacy for generative AI rights has given us profound enlightenment: News media and large model companies must both fight and cooperate , the st - Lujuba

Introduction:

The New York Times's defense of generative AI rights has given us profound enlightenment: news media and large model enterprises are both We need to fight, but we also need to cooperate. Fighting is for better cooperation.

At the end of 2023, the New York Times in the United States took openai (Open Artificial Intelligence Research Center) and its partner Microsoft to court, accusing the two companies of using millions of articles from the media to train generative AI such as chatgpt without authorization. It demands that the use of its content to train AI models and the destruction of training data be stopped, as well as damages. The lawsuit of

has been accepted by the local court in the United States. Although this is not the first case in which a large model company at home and abroad has been sued, it is the first time that a well-known international media has sued a large model company. This may be the most representative and globally influential case of rights protection for generative AI to date, and the verdict may affect the development direction of the entire AI industry and the news and publishing industry.

As a media person, the author can’t help but applaud this.

In January 2023, the photo gallery website Getty Images filed a legal complaint against the AI ​​image generator R&D company Stability AI, accusing it of illegally copying and processing copyrighted images as model training data; in April, Universal Music Group sent a letter requesting music streaming platforms such as Spotify Cut off the access rights of AI companies to prevent their copyrighted songs from being used to train models and generate music; in June, domestic BiShenZuowen issued a statement accusing Xueersi AI large model of infringement; in December, multiple creators sued Xiaohong The book AI model is suspected of being trained using the works of these painters. According to incomplete statistics, in 2023 in California alone, there were dozens of lawsuits against large model developers for illegal use of data.

The author believes that the New York Times's prosecution is not as "baseless" as openai responded.

First, the New York Times’s prosecution has sufficient legal basis.

The United States is one of the countries with the strictest intellectual property rights protection in the world. The New York Times has indisputable copyright on the graphics, texts, videos and other content it produces. Content data is its high-quality asset. If used to train generative AI, it will undoubtedly be relatively scarce and high-quality training corpus.

openai repeatedly emphasized in its statement that because the model learns from a huge collection of human knowledge, any one domain - including news - is only a small part of all training data, and any one data source - including the New York Times ——Special learning of the model is meaningless. However, why OpenAI values ​​data from media such as the New York Times so much is because the data owned by these media is a credible source of training data. It is not difficult to understand why OpenAI has been negotiating with the New York Times and others before. According to Tom Rubin, chief of intellectual property and content at OpenAI, the company has recently launched negotiations on licensing agreements with dozens of publishers. OpenAI is willing to pay some media companies between $1 million and $5 million a year for permission to use news articles to train its large models, according to two media company executives who have recently been in talks with it. In short, data is the cornerstone of large model training. Without credible and reliable data, large model training is water without a source and a tree without roots. The rapid development of large models has caused a "data hunger".

According to the current copyright laws of the United States and the international copyright treaties such as the Universal Copyright Convention, the Berne Convention, and the Geneva Convention to which it has joined, except for fair use and compulsory licenses, unauthorized copying and dissemination of copyrighted works is prohibited. The New York Times complained: "If Microsoft and openai want to use our work for commercial purposes, the law requires them to first obtain our permission. But they have not done so."

openai argued that it uses publicly available Internet materials to train large models It is fair use, a principle that is fair to creators, necessary for innovators, and critical to America’s competitiveness.

The author believes that according to the current copyright law in the United States, large models using copyrighted works for training are difficult to be classified as fair use.

The United States has restrictions on copyright rights, including fair use and compulsory license.The provisions of fair use are concentrated in Article 107 of the Copyright Law. This article not only lists the traditional categories of fair use such as criticism and comment, news reporting, teaching activities and academic research, but also lists the criteria for determining whether fair use is fair use. Four standards: (1) The purpose and nature of the use; (2) The nature of the copyrighted work; (3) The proportion of the used part to the quality and quantity of the exploited work; (4) The impact of the use on the potential market or value of the copyrighted work Impact. This is called the “four-element standard” for fair use determination.

According to these four standards, it is difficult for large models to use copyrighted works for training to qualify as "fair use" because the purpose of their use is ultimately commercial. If openai had strong legal support, it would not negotiate with publishers.

Secondly, the New York Times’ prosecution has sufficient factual basis.

The New York Times believes that OpenAI and Microsoft's generative AI have absorbed millions of its original articles. Not only can they "copy" the original reports word for word to users who ask questions, but they can also imitate their writing style to refine and refine the articles. In summary, it is even regarded as a reliable source. It has collected up to 100 pieces of evidence showing that the content output by chatgpt is highly similar to the news content of the New York Times. Openai's gpt-4 is suspected of directly plagiarizing the original text of the New York Times. The New York Times stated that the companies involved need to bear "billions of dollars worth of legal and actual losses" caused by it. A recent study by

also shows that generative AI developers are using copyrighted materials to train their systems, and generative AI systems may frequently produce text and visual plagiarized output.

In the face of conclusive evidence, openai also admitted this. They call this phenomenon of plagiarized output "regurgitation": "Rote learning is a rare glitch in the learning process that we are constantly addressing, but it is a glitch when a specific content appears more than once in the training data. This is relatively common, for example, if snippets of the content appear on many different public websites. Therefore, we have taken steps to limit inadvertent memorization and prevent duplicate content from appearing in the model output."

The New York Times in the lawsuit Another common problem with generative AI is that it can generate and spread false, meaningless or objectionable content. For example, a chatbot on Microsoft's Bing once listed "15 heart-healthy foods" and pointed the source to the New York Times, but 12 of the 15 foods were not mentioned in the original report. The author believes that this is not only suspected of copyright infringement, but also suspected of infringing on the reputation of the New York Times.

Judging from reports at home and abroad, copyright lawsuits against large model companies mainly focus on infringements in the model training and output stages. The game between large model companies and the news media has been unfolding, with both sides emphasizing the importance of their development.

In May 2023, at the "Interactive Artificial Intelligence and Copyright Law" hearing held by the U.S. Congress, Sy Damle, the former general counsel of the U.S. Copyright Office, said: "Any attempt to force models to pay for training content licenses will either make The bankruptcy of the U.S. AI industry will eliminate our competitiveness on the international stage; or it will drive these leading AI companies to leave the country."

The New York Times stated that if relevant news organizations cannot protect their independent reporting, original news reporting will follow By then, "there will be a vacuum in society that cannot be filled by computers and AI."

The author believes that both sides are willing to cooperate, achieve mutual success, and create opportunities for mutual benefit and shared development. The key is to find a balance point between the interests of both parties. For example, large model companies provide technical support for the intelligent production, dissemination, and operation of news media, support news media in establishing a healthy news ecosystem, and authorize large models to use copyrighted content but must pay a certain fee.

In short, the New York Times’ advocacy for generative AI rights has given us a profound revelation: news media and large model companies must both fight and cooperate. The struggle is for better cooperation

Tags: entertainment