Picture source @Visual China
text | New knowledge of science and technology
The "Battle of Hundreds of Models" that has been fiercely fought for a year is not over yet. Two overseas AI giants have created problems for domestic technology manufacturers.
At the end of the Spring Festival holiday, Google and openai released their respective new AI "nuclear weapons" without warning.
The new generation of multi-modal large model gemini 1.5 pro launched by Google has improved the performance to the million level, completely killing openai's gpt-4 turbo, and is currently the strongest on the surface.
The first Vincent video model released by the latter, sora, is even more impressive based on the stunning performance of visual aesthetics, and has quickly become a hot spot in the global technology circle. In terms of fidelity, length, stability, consistency, resolution, text understanding and other aspects of generated video,
has surpassed mainstream products such as gen-2, svd-xt, pika and so on, and has achieved the current optimal level. It can be said that As soon as he takes action, he becomes a king.
Last year, domestic Internet companies such as Baidu, Alibaba, and iFlytek launched self-developed large models to compete for tickets in the era of intelligent change; mobile phone manufacturers such as Huawei, Xiaomi, Oppo, and Vivo also deployed large models in the hope of new technologies. It has brought new vitality to the peaked market; many start-up companies have also entered the track, trying to travel lightly and overtake in corners.
However, the physical gap makes domestic large-model products still have a big gap with chatgpt in terms of performance, ecology, etc. And now the birth of the video generation model sora will, not surprisingly, set off another wave of following the trend.
However, disruptive effects often come from disruptive ideas. Judging from the current node, how big is the gap between domestic enterprises and the world's front-end in the field of large AI models? What's the difference? Which seed players are likely to stand out?
Panic
Regarding the emergence of Sora, Musk's comment "gg humans (human beings have given in)" is considered a mainstream view.
Prior to this, although there were already a large number of Vincent video technologies, the technology had not yet achieved convergence. The main implementation path was to use various means to make single-frame pictures "move", similar to stop-motion animation. From the actual needs of users, the coherence and naturalness between each frame of the video are the key to value, that is, the seamless connection of the semantic information of each frame of the video is the core.
In other words, sora, a product that provides corresponding technical solutions around needs, is far better than creating products from a technically achievable perspective. According to the introduction on the openai official website,
pointed out that sora is different from the previous Vincent video idea. It allows the model to predict multiple frames at a time and ensure that the main body of the video remains unchanged. This is its cleverness - making breakthroughs in video frames to increase the upper limit of generated video. Zhou Hongyi, the founder of
360, also gave high praise. He believed that the birth of Sora means that the realization of AGI (general artificial intelligence) may be shortened from ten years to one or two years.
As a predecessor of sora, cristóbal valenzuela, co-founder and CEO of aiwensheng video startup runway, lamented that progress that used to take a year has become a matter of months, and has become days and hours. Before the release of
sora, a lot of smoke bomb information was released. For example, openai has formed a new team to study child safety, or is preparing to launch gpt-4.5-turbo, but the real "killer update" is well hidden. This also caused star startups like Pika and Runway to be caught off guard when faced with Sora.
In fact, the attitude of domestic and foreign major manufacturers towards AI video generation has always been ambiguous. The fundamental problem is that the quality and effect of artificial videos are better now, and the cost is acceptable; AI video generation was not as subversive as everyone imagined before, so the overall strategy is biased towards defense rather than offense.
It is worth mentioning that domestic ByteDance and Baidu have a keener sense of smell. Baidu released the AI Wensheng video function as early as March last year at Wen Xin Yi Yan’s press conference. Baidu AI will automatically find suitable video materials based on the text content, generate the video and automatically publish it. This is Wen Xin Yi Yan’s feature. Yanaigc's ttv (text content emotional analysis) function.
ByteDance released Pixeldance in November last year, which can provide guidance for the first frame of the next video clip through the last frame of the previous video clip. It has made a breakthrough in the length of the video, but it is still not open to user testing. So the specific effect is still unknown.
If we look at the development path of gpt, all companies that do AI video generation or even large-scale models will face a new wave of crises. As Zhou Hongyi said, although the development level of domestic large models is close to gpt-3.5 on the surface, it is actually still one and a half years behind 4.0. Openal should still have some secret weapons in its hands, whether it is gpt-5 or machine self-learning to automatically generate content.
But within danger also lies opportunity. OpenAI has proved that it is feasible to use large-scale model ideas to make videos. Vincent Video can become the focus of a new round of global AIGC competition, and also bring higher ceilings to live e-commerce and content creation on short video platforms. All other Internet companies and content platforms need to do is prove that they can also use large models to make videos.
From a technical point of view, sora is a multi-modal hybrid model, which is composed of a large language model and a text and image generator. This also means that the pace of multi-modal model iteration is accelerating. Not surprisingly, the first wave of AI in 2024 will begin.
and
have emerged since chatgpt emerged in late 2022, and its powerful influence has spread to the domestic technology circle like wildfire. A number of major Internet companies such as Baidu, Alibaba, and Tencent, as well as smart hardware companies represented by Miov, seem to have smelled the breath of the new era, and have announced the launch of their own large models, intending to occupy a place in this wave of AI.
At the same time, multi-modal AIGC products such as Vincent pictures and Vincent videos are also being promoted in an orderly manner. From a practical point of view, the applications of AI for generating text and images have already emerged in endlessly, and the related technologies are constantly changing with each passing day. In contrast, AI Wensheng Video is a position that has not been conquered for a long time. It is as difficult as it is valuable.
public information shows that technology companies including ByteDance, Baidu, Alibaba, Hikvision, Wondershare Technology, Torsi, and Danghong Technology are actively deploying Vincent Video, but there are differences compared with Sora. Small gap.
To put it simply, the previous AI Wensheng video tool only stayed at the level of "simulating reality", while sora has jumped to a new level of "constructing reality". The fundamental difference between the two is that the former is only a superficial imitation of the real world, and it is difficult to deeply capture the physical rules and dynamic changes of the real world; while the latter reconstructs an existence parallel to the real world in the virtual world.
sora not only learned the presentation of pixels and images, but also gained a deeper understanding of the "physical laws" of the real world. For example, in the real world, every time we take a bite of food, we will leave bite marks on the food. This is a natural phenomenon that follows the rules of physics. In the video generated by sora, this detail can also be accurately reproduced, so that "there are traces when bitten", thus perfectly reproducing the reality of the real world in the virtual world. This is something other Vincent video products cannot do.
takes Baidu's Wenxinyiyan as an example. Although it can generate videos based on input text, it still has shortcomings in handling complex scenes and detailed description. Moreover, Baidu AI Wensheng Video is more like finding videos that are closer to the meaning of the text from existing material libraries for splicing. It is difficult to generate new video content by relying on AI alone.
Earlier this year, ByteDance released an ultra-high-definition Vincent video model magicvideo-v2. It is reported that the video output by this model is better than the current mainstream Vincent video models gen-2, stable video diffusion, pika 1.0, etc. in terms of high definition, smoothness, coherence, and text semantic restoration. Zhang Nan of
Douyin resigned from his CEO position in early February to focus on the editing business. This means that Douyin will strengthen its layout of AI biographical and video products, of which Wensheng video is naturally the top priority.
However, in Zhang Nan's plan, the higher fidelity generation effect, clearer generation of pictures, smoother and natural logical understanding capabilities that AI videos should have, etc., were also scorned by Sora.
Compared with the low-key performance of Internet giants, some listed companies have actively spoken out recently and disclosed their business conditions in the field of video generation models.
According to incomplete statistics, including Wanxing Technology, Bohui Technology, Yidiantianxia, Digital Video, Hanwang Technology, Danghong Technology, Oriental Guoxin, Shensi Electronics, Yinsai Group, Torsi, Guomai Culture, and Jiadu Technology In the past three months, more than 10 A-share listed companies, including A-share listed companies, have disclosed their business conditions in the field of related video generation models on interactive platforms.
But it cannot be denied that there are very few companies that have truly reached the cutting-edge level. Many companies are just following the trend and hyping up, lacking real technical reserves and R&D capabilities.
Oriental Guoxin bluntly stated that they do not yet have mature technical reserves in the field of AI video generation; while Synthes Electronics responded that the company is interacting with multi-modal data such as Vincent pictures, Pictures and texts, Videos and videos, etc. They conducted in-depth research on the convergence properties of jumps. The implication is that their technology in this area is still in the exploratory stage. The subversiveness of
ai Wensheng's video can be seen in practical applications. Image and video generation can help improve the commercialization needs of enterprises, such as helping to reduce advertiser costs and making videos more convenient. Take ByteDance as an example. One to two percent of its advertisers' total advertising costs are video production costs. Since last year, ByteDance has used related products to help advertisers reduce this part of their investment.
is similar to the previous wave of chatgpt. Although domestic companies will inevitably lag behind in launching products similar to aiwensheng video, it can also be regarded as an opportunity to cross the river by touching sora.
surging
From the perspective of the global market, AI still leads the direction of the entire technology business, and multi-modality has become mainstream. The path from large language models to multi-modality to general artificial intelligence has gradually become clearer, but the point of disagreement still lies in the judgment of rhythm.
Previously, openai spent about half a year testing the large language model gpt-4. If testing sora takes a similar amount of time, this powerful video generation tool may be available in August this year. This half-year period is a window period for other companies to accumulate strength.
After all, chatgpt has been around for more than a year, but there are still a large number of users who have not used chatbot-related products, which also provides opportunities for other companies to catch up.
The biggest problem currently faced by domestic companies is that the stock prices of first-tier AI companies such as Baidu and iFlytek have been hit to floor prices due to various reasons, while the stock prices of top foreign companies such as Nvidia and Microsoft have hit record highs. Hitting new highs, OpenAI’s valuation is still rising. This also means that there are natural differences between domestic and foreign AI companies in terms of capital, talent, technology, market appeal, etc.
Zhou Hongyi believes that the final competition in technological competition is talent density and deep accumulation. This is also true, sora uses transformer+diffusion. From the perspective of the model architecture, if the transformer is used as the benchmark, then Vincent Video will still have priority as the leading technology company, but if the generative video architecture still revolves around diffusion, the opportunities for entrepreneurial companies will be greater.
However, there is no one-stop technology, only spiraling industrial prosperity. Although
sora can generate videos of tens of seconds to one minute at a time, in the actual application stage, if the product does not provide enough micro-manipulation space to ensure that users can integrate it into their own workflow, then there is a high probability that it will only It can be popular but not popular.
Fortunately, the diffusion of technology has just begun, and no company will "suddenly die" due to the emergence of new technologies. OpenAI is more like a pioneer. Its strength lies in paving the way. Popularization and application still require the power of ecology.
is like a modular combination based on the Wenshengwen model. Will there be proprietary smart devices like mobile phones and smart speakers? Let more users use the model on the terminal side, build the idea of open source + small parameter model + mobile terminal, and innovate the current products. This is what domestic manufacturers are good at, but it is also a position for future involution.From the perspective of a single breakthrough of
, sora is a milestone; however, from the perspective of commercialization needs and efficiency improvement of mixed editing workflow, the value and implementation effect of sora have yet to be studied. It is unrealistic for
to become the next Dong Yuhui and Li Jiaqi just by relying on one-minute videos generated by AI, let alone making long videos, movies, and TV series. Even if it is a short video, it is more efficient to modify the prompt words over and over again. Or is it faster to adjust the material according to the creator's ideas in video editing software? Obviously, if we expect sora to become stronger, it is better to expect the AI module to be added to the video editing software as soon as possible to effectively improve work efficiency.
Even if sora finally opens up all registrations, it will be difficult for ordinary users to make video demos like the current demonstration case. Therefore, in the end, the decisive point for each major manufacturer is how to popularize multi-modal applications, how to add AI functions to tools, and optimize workflow more directly.
emerging technologies are universal and are not exclusive to a certain enterprise. For domestic enterprises, when exploring multi-modality, they may wish to refer to the development and implementation methods of gpt, find their own advantages at the application level in specific vertical fields, and use this as a direction to achieve rapid development.
is just in this process, what matters is the density of talents, the degree of implementation and the number of mistakes made.