Will Byte, which attaches so much importance to AI and large models, take such big risks for the sake of "overtaking in corners"?

Picture source @Visual China

text | Metaverse New Voice, Editor | Sun Haonan

As we all know, in the field of large AI models, OpenAI developed Chat-GPT, which is like the teacher assigned a particularly difficult topic in school. While everyone is still sorting out their ideas for solving the problem or wondering about the solution, the top student in the class has already finished writing first, so most people are more inclined to exchange ideas with the top student, or copy the homework directly. The recent turmoil in

seems to have confirmed that many seemingly complex things are essentially the same. Previously, Musk's Grok AI was suspected of plagiarism or even casing Chat-GPT due to data set contamination, and now ByteDance was banned by OpenAI for allegedly violating the terms of service.

ByteDance, caught in a big model public opinion storm

Recently, foreign media The Verge reported that ByteDance used Microsoft’s OpenAI API account to generate data to train its own artificial intelligence model. This behavior has actually violated Microsoft and OpenAI terms of use. Shortly after this news was disclosed, The Verge further stated that OpenAI had suspended ByteDance’s account.

So what specific terms did ByteDance violate? In fact, there is a clear provision in OpenAI's terms of service, that is, the model capabilities provided by OpenAI are not allowed to be used to "develop any products and services that compete with it." AI Model”.

According to The Verge, the evidence is an internal document from ByteDance - the chat history of the overseas version of Feishu Lark. The document

shows that ByteDance relies on OpenAI’s API for development at almost every stage of development in the basic large language model project code-named “Project Seed”, including training and evaluation models. The

"Seed Plan" was launched about a year ago. It is currently mainly developing two products. One is Doubao, which has been launched in China; the other is a chatbot platform for business users, which is currently under development.

employees participating in the "Seed Project" were well aware of the consequences of over-reliance on the OpenAI API, so they began to discuss how to whitewash evidence through "data desensitization." So much so that it often happens that employees reach the maximum access limit of the OpenAI API.

The Verge said according to internal documents that ByteDance issued an order "to stop using GPT-generated text at any stage of model development" about a few months ago.

But it was at this time that ByteDance released its own large language model beanbag. According to Doubao AI’s official Weibo, Doubao AI can provide functions such as chat robots, writing assistants, and English learning assistants. It can answer various questions and conduct conversations, help people obtain information, and supports web web platforms, iOS and Android platforms. Doubao can provide various types of help such as natural language processing, knowledge understanding, dialogue, information retrieval, sentiment analysis, machine learning, etc.

However, ByteDance continues to use the API in ways that violate OpenAI and Microsoft’s terms of service, including evaluating the performance of the models behind Beanbao. “They say they want to make sure everything is legal, but they really just don’t want to get caught,” said one person with first-hand knowledge of what’s going on inside ByteDance.

Three parties expressed their opinions one after another, and only ByteDance

ByteDance

was anxious. After The Verge issued this report, ByteDance spokesperson Jodi Seth responded as follows: The data generated by GPT is in the early development of the "Seed Project" was used to annotate models and was removed from Bytedance’s training data around the middle of this year. ByteDance is authorized by Microsoft to use the GPT API. We use GPT to support our products in non-Chinese markets; but in the Chinese market, we use our self-developed model to support Beanbao.

Yesterday afternoon, the relevant person in charge of ByteDance responded again that company emphasized that it must abide by its terms of use when using OpenAI related services. We are also in contact with OpenAI to clarify possible misunderstandings caused by external reports.

Introduction to ByteDance’s use of OpenAI services:

1. At the beginning of this year, when the technical team began to conduct initial exploration of large models, some engineers applied GPT’s API services to experimental project research on smaller models. This model is only for testing, there is no plan to go online, and it has never been used externally. This practice has been discontinued after the company introduced GPT API call specification checks in April.

2. As early as April this year, the Byte Big Model team had put forward clear internal requirements that data generated by the GPT model should not be added to the training data set of the Byte Big Model and trained the engineering team to abide by the terms of service when using GPT. .

3. In September, the company conducted another round of internal inspections and took measures to further ensure that API calls to GPT comply with regulatory requirements. For example, the similarity between the model training data and GPT is sampled in batches to prevent data annotators from using GPT privately.

4. In the next few days, we will conduct a comprehensive inspection again to ensure strict compliance with the terms of use of relevant services.

OpenAI

OpenAI spokesperson Niko Felix issued a statement confirming that ByteDance’s account has been suspended. " All API customers must adhere to our usage policies to ensure our technology is used for good. While ByteDance rarely uses our API, we have suspended their account while we investigate further. If we If their use is found to be inconsistent with company policies, we will require them to make necessary changes or terminate their accounts," Felix said.

Microsoft

Microsoft spokesman Frank Shaw said in a statement: "Microsoft AI solutions such as the Azure OpenAI service are part of our limited access framework, which means all customers must apply for and obtain Microsoft's Approval is required for access. We also set standards and provide resources to help our customers use these technologies responsibly and comply with our Terms of Service. We also have processes in place to detect abuse and respond when businesses violate our Code of Conduct Stop their access."

It can be seen from the three-party statements in this incident that OpenAI was relatively conservative and only suspended ByteDance's account and said it would conduct an investigation before deciding whether further measures are needed. Microsoft has a "nothing to do with it" attitude, as if "I'm just a middleman, we have our own rules, and we will ban any violations." Bytedance seems more anxious, after all, the "fire" is already burning on it. First, he provided a clarification and explanation, and then immediately contacted OpenAI to "put out the fire" quickly.

Bytedance’s AI layout

Public information shows that as early as 2016, Bytedance established an AI laboratory, focusing on research in natural language processing, machine learning, data mining and other aspects. ByteDance's products such as Douyin and Toutiao have also frequently added AIGC (generative artificial intelligence) functions to continue to attract traffic.

In 2023, ByteDance will significantly accelerate its actions in the AI ​​field. In June, ByteDance's Volcano Engine launched the large model service platform "Volcano Ark" to provide enterprises with a full range of platform services such as model fine-tuning, evaluation, and inference.

html In August, ByteDance’s self-developed general-purpose large model “Skylark” was listed among the first batch of large models that passed the “Interim Measures for Generative Artificial Intelligence Service Management”.

html On August 17, ByteDance launched a public beta test of the AI ​​chatbot "Doubao" developed based on the Skylark model, aiming to develop AI applications for the C-end market.

Recently, while shrinking its gaming and XR businesses, ByteDance established a new AI department, Flow. Relevant recruitment information shows that Flow is an AI innovation business team under ByteDance. It has launched two products, "Doubao" and "Cici", domestically and overseas respectively, and has multiple AI-related innovative products in the process of incubation.

At the same time, ByteDance ordered more than US$1 billion in GPUs from Nvidia this year. Its orders alone reached the total revenue of Nvidia’s commercial GPU sales in China last year.In addition, in terms of talent recruitment, ByteDance also ranks first among the top 10 companies with the number of new AIGC positions, accounting for 3.24% of all AIGC new positions.

's various behaviors show that Byte attaches great importance to AI and large models. Going back to the incident itself, will Byte, which attaches so much importance, take such a big risk in order to "overtake in a corner"?

Metaverse New Voice has something to say After the emergence of

ChatGPT, Byte, like many major domestic manufacturers, is working hard to follow the pace of AI. But obviously Byte is a little further behind. Many people used Doubao after it was launched, but the effect did not reach the first-class level. It seems unreasonable to say that the AI ​​trained by Chat-GPT only has this effect. If Chat-GPT is not used to train Beanbao, then it is expected to achieve this effect.

When Musk’s Grok AI was suspected of plagiarizing Chat-GPT, artificial intelligence researcher Simon Willison said in an interview with Ars Technica: “Many large models are already using data sets generated by the OpenAI API. Fine-tuned, or captured from ChatGPT itself."

But obviously these operations are performed within a reasonable range, and the same may be true for Byte. As for whether Byte is too "quick and quick" and chooses to use it beyond the reasonable range, presumably As a huge Internet company, it should not engage in such plagiarism.