video generation may be the most popular track for large models in 2024.
At waic (World Artificial Intelligence Conference), the Kuaishou booth was hidden at the edge of the exhibition hall. It was almost closing time. The reporter visited the booth of its video generation model "Keling" and was squeezed out by enthusiastic questioners several times. Interested visitors surrounded the staff and asked various questions. When the closing music played, their enthusiasm did not diminish at all, until the staff turned off the equipment and started to "drive people away". The fire of
was initially ignited by sora. In February this year, the large video model sora released by openai caused a sensation, announcing the beginning of the "Battle of Hundreds of Models" in the video field. Since the beginning of this year, there have been runway, pika, and lumaai abroad, and domestically there have been Aishi Technology pixverse, Shengshu Technology vidu, Kuaishou Keling, etc. The direction of the "volume" of large models has shifted from text and pictures to videos.
However, video generation is still in an early stage. There is no consensus on the technical route, the generation process is difficult to control, and the generation effect is still far from commercial standards. Many people in the industry compare it with the early stages of language and image models. analogy.
Liu Ziwei, an assistant professor at Nanyang Technological University in Singapore, believes that video generation is in an era around the large language model gpt-3. At that time, it was still about half a year away from the explosion point of 3.5 and chatgpt. Sophon Engine CEO Gao Yizhao believes that the current video generation is a bit like image generation on the eve of 2022, before stable diffusion is open source, because there is currently no particularly powerful open source "sora" released in the field of video generation.
Many entrepreneurs have already begun to explore and implement it. After all, it will be too late to do it when it matures. In the past, every new round of new technology appeared, "it was launched first when everyone could not understand it."
is still in the "gpt-3" era
"The past year has been a historic moment for AI video generation. A year ago, there were few Wensheng video models on the market for the public. In just a few months, we We have witnessed the emergence of dozens of video generation models," Chen Weihua, head of video generation at Alibaba Damo Academy, mentioned at a forum not long ago.
After sora was released in February this year, there were many product releases with the same name: in April, Shengshu Technology released the large video model vidu, in June Kuaishou released the large AI video generation model Keling, and a week later luma ai released Wensheng Video model Dream Machine, runway announced in early July that Vincent video model gen-3 alpha is open to all users.
In addition to intensive product releases, head video generation model companies have also received financing. In March, Aishi Technology completed a billion-level A1 round of financing, exclusively invested by Dachen Caizhi. Subsequently, Shengshu Technology also announced the completion of a round of financing of several hundred million yuan, led by Qiming Venture Partners. In June, pika completed a Series B financing totaling US$80 million. In July, it was reported that runway was planning to raise US$450 million at a valuation of approximately US$4 billion. Although
is very busy with financing and product launches, from the front user experience layer, the current video generation results are far from expectations. "Nowadays, video generation is about drawing cards. Only by drawing 100 times can we get a better result." Liu Ziwei said metaphorically.
China Business News reporter has used multiple video models to experience. The generated pictures sometimes show a walking person's legs disappearing when alternating their feet, a face appearing behind the head of a person with his back to the camera, or a man and a woman dancing and spinning. The chaotic situation of face exchange, in addition, the waiting time for generation can be as short as 1-2 minutes, or as long as more than 1 hour. This situation of
is not an isolated case. Openai has invited some video production teams to trial sora. One of the teams used sora to produce a short film "Air Head", and the effect was very amazing. However, the production team of this work mentioned in an interview in May that "the generation process of Sora is difficult to control." The entire short film is composed of multiple video clips, but when generating different video clips, it is difficult to ensure that the protagonist is always the one with yellow hair. People with balloon heads sometimes have a face on them, and sometimes the balloons aren't even yellow. Therefore, the entire short film is not the result of direct output from sora, and a large amount of manual post-editing is introduced to present the final effect.
On the waic forum, Chen Jianyi, senior vice president of Meitu, also "complained" about AI video generation: the promotion is very good, but it is not easy to use in practice. He mentioned that many influencers on social media have done a lot of work behind the scenes and may have generated hundreds of videos. One video has a high yield rate and produced a very good effect. He did a lot of post-processing on this effect and released it. , when users watch it, they will feel that AI video technology is now very mature, but in fact, the current situation is still one or two years behind what we imagined.
Current video, image, and three-dimensional generation algorithms will encounter many structural and detailed problems. For example, there will usually be one more thing or one less thing, or the hand will penetrate the mold into the human body. Such refined videos , especially videos with physical rules are currently difficult to generate.
The reason is that Ni Bingbing, a professor and doctoral supervisor at the Department of Electronics at Shanghai Jiao Tong University, believes that all generative intelligence is essentially a sampling process, and video is a higher-dimensional space than images. If we give more training data and reduce the sampling accuracy to a lower level, we can produce better content, but there is a ceiling. "Because our dimensional space is too high, we must be foolproof and true." Indeed, it is quite difficult with the current technical framework. "Behind this, computing power is a big constraint, and it is impossible to solve the problem with unlimited large computing power sampling.
Chen Jianyi compared the current stage of video generation to the history of film development, "The original film was a set of continuous photos. 24 photos were moved continuously in one second. Thousands of photos were taken, and finally a 1-minute black and white video was produced. Movies. The current AI video generation technology is still in its early stages. In fact, it is similar to the starting point of the 1-minute black and white movie. "He predicted that video generation may undergo a rapid evolution from primitive to advanced in the short term, with 3-5. The century-old development history of film technology has been completed in 20 years.
Gao Yizhao believes that the current video generation is a bit like the eve of image generation in 2022. “After stable diffusion was open sourced in August 22, aigc image generation began to explode, but there is currently no particularly powerful open source sora released in the field of video generation. "
Liu Ziwei compared the current progress of video generation to the stage of large language models, "It is a bit like the era around gpt-3. It is still about half a year away from the breaking point of 3.5 and chatgpt, but it should not be far away." By analogy with Wen Shengtu, you will find that it only took a year and a half from the first generation to the final large-scale explosive application. Liu Ziwei believes that a lot of capital has entered the video field. With sufficient data and computing power, this explosive The time will come soon.
Qiming Venture Partners recently released a "Top Ten Prospects for Generative AI in 2024", one of which is that video generation will explode in three years. The report believes that combined with 3D capabilities, controllable video generation will have an impact on film, television, animation, and short films. The production model brings about changes. The compression rate of future image and video latent space representations will increase by more than five times, resulting in more than five times faster generation.
sora is not necessarily a perfect solution. Compared with the large language model, the technical routes of
are almost convergent. An important problem currently faced by video generation is that the technical route has not yet reached a consensus. As far as the current team is concerned, there are many different solutions. The technical roadmap is in progress at the same time. The industry believes that sora is not necessarily the best solution, and it is likely that new teams will come up with different "solutions" in the future.
"Last year, everyone generally used SD (stable diffusion) for image and video generation, but when Sora appeared this year, everyone felt that it would be changed to a Sora-like dit (diffusion transformer) architecture." Gao Yizhao told China Business News, It can be seen from this incident that the field of video generation is not as mature as the field of text, nor is it such a solid technical direction. It needs to continue to innovate.
As far as the technical route is concerned, Gao Yizhao believes that sora is not necessarily a perfect solution. It is just better than the previous generation solution and has certain advantages. “But we can’t tell whether there will be a new one by the end of the year or next year. The architecture is out. "
video generation now has several different paths.One is the original diffusion model, which follows the Vincentian diagram and expands the Vincentian diagram to the time dimension; the second is to follow sora and build a dit architecture based on transformer; and the other is to redo the video and visual content using a large language model. , that is, using the autoregressive architecture of the large language model (llm). The video generation model videopoet released by the Google team at the end of last year is based on llm to achieve video generation.
Liu Ziwei believes that if you want to make a short video, such as 3-4 seconds to make the picture move, the diffusion model technology is enough, but if you want to make a longer video, such as 10-20 seconds, the dit architecture still has The greater advantage is that this technical path will have stronger understanding of long texts or long videos and better generation capabilities. But even sora's dit architecture does not have enough understanding of physics and world models, so some teams are also trying to use the knowledge learned in language models to help generate a visual world.
"This path (autoregressive architecture) currently seems to have less visual effect than the other two, but I personally think its upward trajectory will be very fast. Maybe by the end of the year, I will find that it is better to use language models for generation. At that time, we will truly integrate all modalities." Liu Ziwei found that in terms of training cost, diffusion is relatively low, while autoregression is high, but once autoregression is trained, the cost advantage in reasoning will be. Very big.
At present, the computing power of large models is still very constrained. Ni Bingbing believes that some new architectures, new computing methods or new underlying technologies may be needed in the future to support a more efficient generation method. The black boxing of
neural networks is the core of the current problem that consumes a lot of data computing resources. "For the generated network, we have no idea which node here is related to the content we want to generate and control, and we don't know what a certain input word is." Which units in such a node are related, and we don’t know which units in the neural network are related to the shape of a certain part of the face we output. "Ni Bingbing said that what is needed now is white. With boxed generation technology, if the content in the video can be mapped to network parameters, we can accurately control the generated content. Behind this, we need to solve the problem of parameter alignment and the representation of data content.
Sora is currently the king in the video field, and has been the target of domestic catch-up since its release. Gao Yizhao believes that just talking about the underlying technology, we are not far behind Sora. What is more decisive is the gap in resource investment and the thinking in the direction of product construction.
"In fact, some new domestic entrepreneurial teams are no different from the world's top large-scale model teams in terms of underlying technology. They all have the same set of architectures," Gao Yizhao believes, but if we want to talk about products and applications, then There will be a lot of details, "For example, how to use these technologies to make good applications, and what technologies should be used to make good applications. These are all very difficult things."
Runway released a new Vincent video model last month Gen-3 alpha, one of the case videos is a silhouette of a woman next to the window of a high-speed train. As the train is traveling at high speed, the neon lights outside the window shine on the woman’s face, and there are traces on her cheeks and nose. With different levels of effects, these rapidly changing lights and shadows transform very naturally and realistically on the characters' faces.
Gao Yizhao guessed that the effect of runway is mainly achieved through targeted data training. "Runway has done a lot of data specifically to train light and shadow from the beginning. This is actually the direction of the product. The team believes that for this product to truly solve the needs, light and shadow must be natural, so they will conduct training in many targeted directions." He believes , the product layer and the technical layer are two sets of thinking.
In the field of video generation, Liu Ziwei hopes to explore "Newton's first law of video generation" in the future. He mentioned that for language models, how much computing power is currently invested and how much data is used can achieve how much gain. This input-output ratio can be calculated, which is a good point for capital, industry and applications. , but for video generation and multi-modality, there are currently no clear standards. How much computing power can be improved is an essential question.In addition, in terms of architecture, whether autoregression or dit is definitely the end game, and whether training costs can be reduced are issues to be explored.
"Go first when everyone can't understand"
In a forum, when talking about the business model of video generation, Shi Yunfeng, vice president of Wuyuan Capital, was more cautious. He judged that in terms of the current effect of video generation, "In "It's very challenging to build a castle on quicksand." The technical foundation has not been stabilized. At this time, it is very challenging to find PMF (product market fit).
"A video generation tool that ordinary people can use. It will be observed that users are very disloyal and move very quickly between different apps. luma gained 1 million users within 4 days after it was released. These 1 million users I have used piika more or less before, so it doesn’t make much sense.” Shi Yunfeng believes that there are creators for video generation today, but the problem is that more mature content consumption has not yet emerged.
Compared with investors who wait and see, more entrepreneurs are another kind of "doer".
"In the past, every time a new round of things came out, we definitely didn't wait until it was mature before doing it. Then it would be too late. It was done when everyone couldn't understand it." said Kong Jie (Huana), founder of fancytech.
fancytech is currently a self-developed video and image model, focusing on TOB, generating basic materials for merchants, replacing basic shooting parts, such as shooting of products, items, models, etc. Kongjie mentioned on the forum that fancytech’s revenue last year was close to 10 million US dollars, and this year it is expected to reach 20 to 30 million US dollars.
“We feel that now is a good time for application,” Kong Jie said when talking about the launch of the application, “To make money, you must stay on the poker table and be able to ensure such income. When new technologies continue to emerge, At this time, we can superimpose on this and obtain our characteristics at the same time." Xu Huaizhe, founder of
morph ai, believes that uncertainty is a huge opportunity and significance for entrepreneurship. "Any big company grows from the very beginning. The opportunities left for them are that every time a technological wave is updated, the business model has huge uncertainties. If you know the answer and move forward step by step, this must be a big success." Opportunities for big companies. "
"It's one thing to catch up with the hot spots, but more importantly, it has to generate actual value." Regarding the popularity of the video generation track this year, Gao Yizhao believes that following the top trends in the field is the key. Products and investment are inevitable, but China also needs to form its own set of playing methods and logic. We may be temporarily behind in technology and resources, but in terms of landing applications, we still have scenario advantages.
"Once the technology in the AI field is opened up, it will not be as difficult for everyone to copy it as imagined. Therefore, the core competitive point is still in the application. When the technology is similar, how can we delve into a certain field and solve the real needs of users." Gao Yizhao believes that application implementation is a question that AI practitioners around the world must answer.
Currently, the first choice of implementation scenario for Sophon Engine is urban inspection. "Equipments such as drones will take pictures of some visual content and send this visual content back to our large model for analysis." Gao Yizhao said that in such a scenario, the versatility of the large model has the advantage of being able to It can solve various emergencies in complex real environments, such as rainy and windy weather conditions, incorrect camera angles, etc. Compared with previous AI software with small parameters, large models have wider applicability.
Looking at the C-side, Chen Jianyi judged that there is no opportunity for an AI video platform to be born in the short term. "The AI version of Douyin is currently unlikely." But if there are still opportunities in the industry, he judged that the opportunities for AI video now are not in the traditional form of film and television content, but in the generation of aerial footage materials, various MVs, story picture books, online short dramas, etc.
"For example, if a company wants to shoot a promotional video, it needs to insert two or three pieces of natural scenery. At this time, there is no need to do some actual shooting of the content. It can be generated quickly using AI video generation." Chen Jianyi believes that in the short term, it will be obvious that AI video generation will be of great help to various aerial mirror materials. In addition, during teaching, students only need to enter "I want to see the process of melting icebergs" in the prompt word At this time, video generation can display complex physical knowledge through intuitive videos. Zhu Jiang, the founder of
Jingying Technology, has an interesting analogy. He believes that the current AI generative era is a bit like the Cambrian explosion of life. "Many animal categories today suddenly appeared during the Cambrian explosion. At that time, it was actually very difficult for any species to consider whether it could survive in the future. "He mentioned that the big change at that time was that a type of creature suddenly evolved eyes, and they gained a phased advantage.
(This article is from China Business News)