text | The wind of Silicon Valley 101
aigc has finally reached the music circle - in March this year, the "chatgpt of the music industry" suno v3 made its debut. Users only need to enter a prompt word on suno, and two songs can be generated in a few seconds. The two-minute complete song, from lyrics, composition, performance to vocal singing, is completed in one go, greatly lowering the threshold for ordinary people to create music.
Audiences and users who have become accustomed to various "AI singer covers" quickly embraced suno, from "Kung Pao Chicken Aria" to "Let's Swing the Oars" heavy metal, from English, Japanese, Russian to Mandarin and even Cantonese , netizens spontaneously uploaded a wide range of works, and platforms such as NetEase Cloud Music and QQ Music quickly launched sunoai music areas, and even launched regularly updated official recommended playlists.
This issue of "Silicon Valley 101" invites two guests from the field of music and AI music generation, well-known music blogger "Daodao Feng" Feng Jianpeng, percussion lecturer at the Hartt School of Music at the University of Hartford, USA , Broadway full-time Performers and roger chen, meta music tech lead, share their views on the Wensheng music model represented by suno, and how AI will affect the future of the music industry.
Highlights of this issue:
⭕️Why do you think suno is the most popular? Because he dared to be the first to make the Vincent music model public
⭕️[Evaluation 1] "Sad rock music that can't find a job", the result is not sad
⭕️You can listen but have no attitude, ai can only write rock music that is not angry
⭕️ai The songs written can reach the average level in the industry, but they cannot stand out and become the top works
⭕️Speed is one of the most important indicators of music creation. Why can't AI write 80 bpm music?
⭕️AI song writing is not the same logic as human composition. It can only be written in order from left to right, and there is no overall view.
⭕️The training materials are comprehensive and rich enough. Is it possible for AI to write songs at the level of Taylor Swift?
⭕️Same music, Singers with different levels of performance will play differently
⭕️[Evaluation 2] The hero-themed symphony can score 7 points for listening experience. As Party B, it failed.
⭕️suno cannot generate the specified instrument according to the prompt word. It only pursues the sound. Probably similar to
⭕️AI-generated music is an irresistible torrent, but it is temporarily impossible to write songs like a musician
⭕️[Evaluation 3] Challenging fugues with strict rules, will suno perform better?
⭕️Fugue research has a 20-year history in AI music, but it is still very immature when it comes to Bach’s original music
⭕️Music technology and psychology: How was mp3 technology invented?
⭕️The essence of music is "organized sound", which is the underlying logic of Vincent's music model.
⭕️Are you most afraid of boredom? Art needs to jump out of human summary
⭕️The creator’s imagination is wide open: adding a random number mechanism to the music
01 Let AI write a "sad rock of unemployment", the result is not sad
"Silicon Valley 101" : Except for suno , there are several other music-generating software recently. Why do you feel that suno is the most popular?
roger: because suno is the boldest and dares to be the first to release their AI-generated music model . Other large companies, such as Facebook and Google, are actually leading in technology, but they have many considerations. In addition to releasing this technology, they also need to consider the impact it may have on society.
Especially music, unlike text or images, its copyright issue is actually a very sensitive topic. If you have a huge amount of data, assuming you use all the songs in the world to train a model, it will definitely be able to produce a very good effect. But it may face many legal problems, and even say that you have changed the structure of the entire music industry. The cake is so big, how will it be divided in the end? Share money with record companies and publishers? If these issues are not thought through clearly, the consequences may be disastrous.
"Silicon Valley 101" : Sounds dangerous. In fact, it is not a technical issue, but a copyright issue.
Let us first hear how strong suno is.Please ask Teacher Feng to give you a field test and challenge some professional and difficult music generation. We can also help you with some questions, such as writing a sad rock song about "unemployment" or "failed job interview".
Nao Daofeng: Okay, then we will let suno write the lyrics himself, and we will make some restrictions on the attributes of the music: sad story of not finding a job, classic rock, 80 bpm, guitar, base, drum, keyboard.
"Silicon Valley 101": The name chose itself is quite artistic.
Taotao Feng: I think is consistent with my theme in terms of its generated words. But from a musical point of view, we designated "sad story", and I didn't hear much attitude. The music itself is of average level and can meet our requirements, at least it has the meaning of rock and roll.
I have tested many Chinese songs before. In comparison, English songs seem to be more mature when generated by AI. AI's ability to understand English lyrics and convert them into music seems to go further. But in terms of the structure of the music itself, such as rock music, it usually contains two verses (verses) and a subsequent chorus (chorus). In the music generated by , when transitioning from the main song to the chorus, I feel that there is a lack of a sense of progression and a driving force. In other words, we can hear the verse switching directly to the chorus, lacking the "build up" (sense of accumulation) that gradually builds up on the instrumental music and pushes towards the climax, just like there needs to be some progress before reaching the climax. A process of accumulation and then explosion, and the music generated by AI lacks this final push.
However, ai did a pretty good job of distinguishing between the two verses and generated a better interlude. In music creation, if the music is composed by humans, usually the emotional changes between verses will not be too big, but when the verse transitions to the chorus, the emotional changes will be more obvious. When
ai creates music, compared to real people composing music, the biggest problem is that it lacks an "attitude". In other words, it lacks a creative motive. If I were a real person songwriter, there must be some specific reasons behind writing songs, such as frustration about not being able to find a job, or anger about something. These emotions make the music sound more emotionally relevant.
Although the current music generated by AI can meet the basic needs of text description, my test results show that AI cannot yet achieve it in terms of reflecting human emotions in composition and arrangement. The reason why music has become a classic is because the humanistic spirit and attitude it carries resonate with people. Although there are thousands of rock songs, only a few can become masterpieces passed down from generation to generation. Although the music generated by ai can be created, it is difficult to stand out in the industry because it lacks the attitude that resonates. Therefore, at this level, AI has not yet completely been able to replace the emotions of human composers.
"Silicon Valley 101": Human composition needs to express emotion and resonance, and sometimes it also requires some luck. If compared with the average level of the entire music industry, do you think AI has reached it?
Taobao Feng: I think ai’s music generation ability can be said to be close to the average human level of . If we rank 10,000 songs, AI's music may be in the middle, such as between the 4,000 and 6,000 songs.
But the problem is that with being average in the music industry may not be enough to stand out over . Of the classic rock music we can think of, each person may only be able to name 100 or 200 works that they can really remember and are willing to pay to listen to. The remaining works, although they may be above average, are not enough to become the top in the industry and can support a professional musician. Whether it can survive in the music industry is still a question.
However, there are situations where music is not so demanding. For example, I may need a piece of rock-style music as the background music of a short video, but it doesn’t need to be particularly outstanding. In this case, I think the AI-generated music is good enough. Another advantage of AI music is that it can provide better customized services, especially in small-cost productions, such as film and television soundtracks.Although the existing library of free copyrighted music is vast, it is not easy to find music that perfectly matches a specific theme. AI can generate more relevant music based on specific prompt words, solving this problem. But that's all.
"Silicon Valley 101": You just mentioned writing 80 bpm, but AI seems unable to understand this indicator. What does this indicator represent?
Daodao Feng: bpm refers to 80 beats per minute, which is an indicator of music speed. In music, speed is probably one of the most important elements. For the same song, if the speed is slowed down two or three times, the originally cheerful song may sound sad; conversely, if the speed is sped up, the sad song may sound happy. There's a scene in the movie Big Shots that illustrates this, where the dirge is sped up to sound like holiday music. Although I believe it is technically possible, the current test results show that AI control in this area is not yet mature.
"Silicon Valley 101": also asked Roger for advice, why do we feel that the generated music is not sad enough? Is it because it can't understand the concept of "sadness", or is it because the way it's generated can't?
roger: Teacher Feng just mentioned that if 10,000 songs are sorted, the music generated by AI may be ranked at the bottom 7,000 to 8,000 songs, which cannot reach the top level. This phenomenon is related to the large models and training data used by AI.
The music industry is an industry with obvious head effect, and a large number of works can only be ranked at the bottom.
The databases currently used in the industry are mainly copyright-free music libraries, such as shutterstock music, etc. These libraries not only provide audio files, but also rich metadata.
These training data are usually not top-notch music. If the generated music is similar to the music in the copyright-free music library, from the perspective of model learning, it has achieved the goal, which is why the music generated by AI may not be outstanding. .
Another problem is that when we listen to AI-generated music, we may feel that the transition between different paragraphs is sudden, such as the transition from the first verse (verse 1) to the chorus (chorus). This is because human composition usually adopts top-down logic. first determines the overall structure, such as the aaba form, and then gradually determines the chord progression and orchestration of each part. In contrast to
, the generation process of the ai model is from left to right. It does not have a global perspective and generates music step by step. As a result, changes in the music can sometimes seem sudden. For example, when generating eight lines of lyrics, if each section should be sung as expected, the AI may sometimes insert two lines of lyrics into one section, resulting in one missing sentence. In order to solve this problem, AI may forcibly add a lyric, or directly transition to the next part by adding drum beats, etc. These are all problems that may be encountered during the music build up process.
Another question is about the "soul" of the lyrics. This cannot be entirely blamed on the suno model, as it uses a text generation model. Most of the text generated by AI is based on the abstraction of a large number of articles on the Internet, and most of the content itself has no "soul". Therefore, how to inject emotion and soul into AI-generated content is a key challenge and an advantage of human creators over AI.
As for why ai can't understand bpm, this really surprises me. Because in the training data, the bpm of each song is clearly marked. But the AI may not make use of this information, or in the current version of the model, bpm is not an important consideration. Technically, this is an easy problem to solve.
"Silicon Valley 101": Just now you mentioned the issue of training data. The data we use to train AI comes from free copyrighted music libraries. If we use classic works such as Taylor Swift, Queen, Coldplay, etc. as training data, can AI generate similar works?
roger: Yes, in theory, as long as the training data is good enough, AI can do it. But training data is more than just the audio itself, it also needs to be properly described. If you just download a song from Spotify without describing it, the AI doesn't know what to learn.You must tell ai, for example, what kind of song "yellow" by coldplay is, so that next time ai sees a similar description, it will know to generate a song similar to "yellow".
"Silicon Valley 101": But if AI generates a song that is very similar to "yellow", and even the sound imitates coldplay, does this constitute infringement?
roger: unless some kind of settlement can be reached with the musicians in the future. Musicians may realize that once Pandora's box is opened, it can't be taken back. They may only be able to accept the reality of AI-generated music, as long as they are properly compensated.
"Silicon Valley 101": But at least for now, using musicians' copyrighted works as training data is still prohibited.
roger: Yes, there is now an organization called "fairly trained" that keeps an eye on suno, constantly looking for works that may be too similar to copyrighted music. If such a work is discovered, they may take legal action.
02 Write a symphony with a heroic theme, but AI failed
"Silicon Valley 101": Regarding some classic symphonies in history, what is their copyright protection like? In my mind, there is a concept of public domain, and the songs in it have a copyright period.
roger: Yes, generally 70 years after the death of the composer, the work will enter the public domain.
Taotao Feng: Once it enters the public domain, the score itself has no copyright and anyone can play it. But if you record the scores, such as by the New York Philharmonic, then the recording itself is protected by copyright. So, if you use these recordings to train AI, there may still be copyright issues. Unless AI can use images to train voices, that might avoid copyright issues.
"Silicon Valley 101": This means that you can use some synthetic data in the software. For example, let the computer automatically generate sounds based on musical scores, and then use these synthesized sounds to train the AI model. This is possible.
Taotao Feng: is feasible from a copyright perspective. But I'm worried that this may not be ideal compositionally. Because even the composition software currently used in the music industry is not fully satisfactory in its ability to simulate sounds. The best film scores and other works still need real people to record them. The software's detailed processing of timbre and playing methods, such as the different playing techniques of the violin, is not yet perfect. This would be very time consuming if every detail of each instrument needed to be adjusted.
"Silicon Valley 101": We just discussed those very classic pieces of music in history, which can be used for free 70 years after the author's death. Is such a music database large?
roger: For the record industry, the real development started in the 1950s. Therefore, according to this timeline, the works of artists like Elvis Presley, or earlier jazz pioneers, may not gradually enter the public domain until 2020. While some early recordings may exist, the sound quality is poor. So it may be another 70 years before this music becomes widely available.
"Silicon Valley 101": We just tested rock music, now try classical music.
Taotao Feng: is okay, no problem. This time we'll test it with instrumental music, and I'll try to specify some instruments. We want it to generate a symphony with a "heroic" theme, specifying instruments such as strings, woodwinds, brass, and timpani in percussion, which are relatively common configurations.
Taotao Feng: Let’s listen to the second song, because suno will generate two songs at the same time, and the difference may be quite big.
Taotao Feng: I think the second song sounds more heroic and closer to the symphonic style than the first one. However, I personally feel that both of them are a bit like movie soundtracks, and may still be lacking compared to real symphonic music.
I want to try it again, specifying the classical music style, and this time I will mark a more specific time range, the nineteenth century, so let's generate a new tune again and try it out.
Taotao Feng: After specified the time this time, the music generated was much better than before. However, there is no obvious percussion part, such as timpani, etc., which is mainly dominated by bass strings. The woodwind and brass parts seem to be mixed together, and the timbre is difficult to distinguish.
This piece is closer to classical music in terms of melody writing and rhythm than the previous one. It is not very repetitive overall, and has a certain motivation and gradually develops. But there is still a certain gap to achieve the form of a true symphony. Another problem with
is that some parts of the music I generated are okay, but it feels like winning the lottery, and there is a certain amount of chance. Although the writing level in some parts is good, the AI did not meet the requirements of some of the musical instruments that I specifically specified. If I were Party A and ai was Party B, I would think that Party B did not fully meet my requirements.
"Silicon Valley 101": Can take out the music generated by AI, add some musical instruments, and modify it to a music level acceptable to Party A?
Daodao Feng: is possible, but the workload will be very heavy. Nowadays, people often joke that it is more appropriate to use AI-generated music to find inspiration: AI writes a piece of music, and can grab a few bars from it, use it as a theme (motive), and then expand it into a large symphony. However, works that are directly generated using AI currently seem to be far from the standards of a symphony.
"Silicon Valley 101": If the full score of is 10 points, how many points would you give to the music generated by AI?
Taotao Feng: In terms of writing and listening, I can give it a score of 6 to 7, at least it sounds very similar to what it is. From a strict perspective, such as meeting the instrument requirements, I might only give it a 5.
"Silicon Valley 101": What do you think about Roger? It may miss some of the instruments we require it to use.
roger: About the first song When we added the tag "nineteenth century" to the second try, the results improved. This comes down to a problem with the training data. There are two types of string genres in the training data set, and the AI needs to understand and match the corresponding tags to generate music. For classical music, there is a specialized genre called "master works", and the AI must understand these terms in order to generate the music correctly. If we want to generate better music, we need to study the labels of the training data set, which can provide some inspiration.
on why ai cannot accurately reproduce the specified woodwind and brass sounds. When AI generates music, it is not based on the model of a single instrument, but by analyzing a large number of recordings, abstracting the basic elements of music, and then combining these elements. ai doesn't really understand what a brass or woodwind is, it just uses the provided characteristics to generate music that sounds like those characteristics. Future development directions for may include advances in sound source separation technology, which will allow us to separate existing recordings into separate audio tracks (stems) and then train each instrument individually, allowing the AI to have a unique understanding of each instrument. A deeper understanding.
talked about the potential of AI to provide inspiration for musicians. Currently, AI mainly supports text input. But technically, the same AI architecture is fully capable of supporting audio input. For example, if the user can be allowed to input a classical music piece, such as a Mozart piece, and then instruct the AI to add elements such as electronic drums through text, and observe how the AI fuses these elements to generate new music, this may be of great benefit to music creators. A very useful tool.
However, the current AI music generation tools are very popular. They assume that users know nothing about music and can only describe the music style or elements they want through text. This design may be successful in terms of commercialization. I believe that more companies will enter this market segment and develop AI music generation tools that are more professional and suitable for musicians.
"Silicon Valley 101": Mr. Feng, I would like to know the general attitude of musicians towards AI music products like Sono. Is it welcoming or somewhat resistant?
Taotao Feng: I cannot represent all musicians, I can only express my personal views. I know that some musicians, such as more than 200 artists in New York, have publicly expressed resistance to AI technology. AI has indeed had an impact on our industry. My attitude is cautiously optimistic.
First of all, we cannot resist the trend of technological development. For low-cost music production, AI is also of great help. But I'm not particularly panicked, because human music has some unique characteristics that current AI cannot fully realize.
ai is mainly based on statistics, while music creation requires deeper logic and cultural accumulation. Unless AI can transcend statistics-based limitations and develop real intelligence and creativity, I don't think it will pose a threat to the entire music industry.
I think AI can be a powerful tool for musicians and help improve creative efficiency. But AI cannot completely replace human creativity and emotional expression.
03 Facing the fugue with strict rules, will suno perform better?
"Silicon Valley 101": Before tests AI-generated fugue music, can you first explain to everyone what fugue is? Then we can play a standard fugue piece in history, and then compare and listen to the fugue generated by AI.
Taotao Feng: Fugue is a complex composition form that uses counterpoint to create music. Unlike modern pop music, where a melody is composed first and then chords are set, fugue focuses on the relationship between each note or group of notes, moving from harmony to dissonance and then resolving back to harmony. There are many strict rules for creating fugues, such as avoiding parallel fifths and octaves.
In a fugue, there is usually a subject to which other voices respond. In this way, and with some varying techniques, the entire piece is constructed. There are many systematic rules for writing fugue, and these strict rules ultimately limit its further development. Musicians felt the need to break out of these frameworks and explore more innovative possibilities, which is why fugue did not continue from the Renaissance into the 20th century.
roger: Let me share a prompt, which is Bach's Toccata and Fugue. This prompt is taken from the training data set, and I want to see if if you input this prompt, the AI can generate music that sounds a lot like Bach, or very similar to the original music.
Daodao Feng: Okay, this prompt word describes a Toccata and Fugue in D minor, which needs to have a dark and dramatic effect, with an organ solo, and the overall feeling is serious and powerful. This prompt describes a very famous piece by Bach, probably one of the most familiar works. The results generated by
ai have come out, and the picture is of a church, which is very appropriate.
Taotao Feng: I think the music generated by AI feels very similar to the original song, but the actual difference is still quite obvious. Especially if you have listened to Bach's original music, you will find that the beginning part is very shocking. This sense of shock is strongest when listened to in a church or a wide space.
Taotao Feng: Let’s only compare the beginning of . The shocking feeling at the beginning of Bach's music and the subsequent clarity of each part are difficult to directly achieve with AI's current training methods. What I want to emphasize most is that the first impression of the music generated by AI is quite different from the original music. Bach's music is very clear in the processing of voices, while the AI-generated music seems a bit vague in this regard.
In fugue music, it is a very obvious feature that two voices echo each other. First, the first voice proposes a theme, and the second voice repeats the theme, forming a dialogue effect. In fugue writing, the same melody is repeated in different parts with variations, but even during the variations, the listener can still recognize that they originate from the same theme.
However, in the music generated by AI just now, the echo between the voices and the consistency of the theme are not obvious, and can be said to be quite vague. To me, the music generated by ai sounds muddled, with the various parts of the sound stuck together. Although the sound of the pipe organ and the existence of two voices can be recognized, they lack a clear theme and rigorous logic, which is something that AI has not yet been able to achieve in .
"Silicon Valley 101": The logic of fugue music is very rigorous. Does this mean that it is more suitable for AI generation? Because AI is good at handling logical and formulaic tasks.
roger: Research in the field of ai music has indeed been going on for many years, including fugue music. Bach's music scores are easy to find online, and AI can use these logical music data for modeling. At the symbolic level, AI has been able to simulate fugue music very well, including themes and variations.
However, current end-to-end generation systems, such as suno, are not specifically designed to generate fugue music. The effect generated by the AI depends on its training data. If the AI has only heard one fugue track, it may not learn well. In AI music generation, AI systems that are biased toward logical reasoning may perform better when processing music such as fugue music.
04 The creativity of creators is wide open: adding a random number mechanism to music
"Silicon Valley 101": Teacher Feng, as a professional musician, do you have any special questions about using AI to create music?
Taotao Feng: We usually think that great composers create by inspiration, but I think music may have more to do with cognitive science. All emotions and thoughts are ultimately boiled down to electrical signals or chemicals.
Why do some music make people feel happy while others feel sad? AI has many intersections with music disciplines in the music research process and can conduct many interesting studies.
"Silicon Valley 101": Teacher Feng once mentioned that music is most afraid of boredom. Can AI overcome this problem in the future and create music that is both logical and unexpected?
Taotao Feng: Whether AI can create something out of nothing is a key question. ai can do a good job based on existing knowledge, but the development of music requires innovation, such as the development of rock music from jazz. The current working principle of AI is still based on statistics. It summarizes existing human music to generate new works. Art needs to make breakthroughs beyond the scope of human knowledge, and AI cannot currently do this.
It would be amazing if AI could surpass existing models and develop real innovations. Although such development is still a long way off, I would be very excited if AI could calculate and develop entirely new musical forms and styles, even if this may bring some moral and ethical challenges.
"Silicon Valley 101": From a technical perspective, Roger, do you think AI can overcome the monotony and boredom in generated music?
roger: I think it is possible to a certain extent. Music is organized sound, and certain musical genres are actually rearrangements of existing elements. For example, different rhythm patterns can make music sound completely different, even though the instruments used may be similar. This way of rearrangement and combination, including many current genres, such as some branches of hip-hop music, innovates in rhythm.
If AI is given enough data and computing power, it can theoretically generate unprecedented music combinations that are in line with human aesthetics. However, AI may not be able to automatically screen out these innovative combinations, which requires human aesthetic participation for selection and guidance. In the long term, many people may try various music fusions, combining African, Latin and other ethnic elements with electronic music to create novel music genres. The key is whether someone can capture these innovations and promote them in human society.
As humans create more excellent musical works, AI will also obtain higher-quality training data, forming a feedback loop in which humans and AI develop together. AI will push human musicians to create better works, and AI itself will continue to improve after absorbing these excellent works.I believe that in 20 years, both human music and AI music will reach a higher level and achieve coexistence and common progress.
"Silicon Valley 101": The process of AI music creation in sounds a bit like developing in the direction of general artificial intelligence (AGI).
Daodao Feng: I would like to add a suggestion that may sound ignorant. The current AI follows the labels and logic we set for it to a certain extent. Can introduce a random number mechanism into AI to allow AI to generate some real randomness, such as new timbres or rhythm patterns. This may lead to more novel and interesting results, rather than just a recombination of existing elements.
It's a bit like the "God rolls the dice" concept. While human composers have always experimented with different musical combinations, adding randomness could lead to real innovation. I don't know if AI can achieve this.
roger: There is already a certain degree of randomness in actual ai. For example, even if the same prompt is input, AI can output two different songs. This randomness is introduced at every step of the generation process, and the AI will have a certain random selection when generating each short piece of audio.
In addition, there is a parameter called "temperature" in the AI model that can adjust the degree of randomness. If it is set lower, AI will strictly select the next step based on the maximum probability; if it is set higher, AI will be more willing to explore options with less probability, which may produce some surprises. The current randomness of
is mainly introduced during the generation process, but in the future we may try more diverse randomness, such as controlling it at a semantic level that humans can understand. Such randomness may lead to richer and more interesting music-making results.
[Related supplementary information]
bpm: beats per minute, a musical unit for measuring speed, how many beats per minute (bpm) represents a specified note, such as a quarter note, the number of times it appears in one minute, the larger the bpm value, the greater the number of times a specified note appears in one minute. The faster.
fairly trained: A non-profit organization initiated by former technology company executives such as Stability AI and Humanistic AI, well-known Hollywood law firms and music industry figures, certifies artificial intelligence models covering image, music and song generation, proving that they have applied Permission to use copyrighted training data.
Fugue : It is the transliteration of the Latin "fuga". It is a polyphonic music genre popular in the Baroque period, also known as "escape music". It is the most complex and rigorous form of polyphonic music. The structure and writing method of fugue are relatively standardized. At the beginning of a piece of music, the main musical material that runs through the whole piece in a single voice is called the "theme", and the counterpoint relationship with the theme is called the "antithesis". Later, the theme and antithesis can appear in turns in different parts. There are often transitional phrases between themes for musical contrast.
Masking effect : refers to the fact that due to the appearance of multiple stimuli of the same category (such as sounds, images, etc.), the subject cannot fully receive the information of all the stimuli. Specifically divided into visual masking effect and auditory masking effect. Among them, the auditory masking effect means that the human ear is only sensitive to the most obvious sounds, and is less sensitive to insensitive sounds. For example, in the entire frequency spectrum of sound, if the sound in a certain frequency band is relatively strong, people will be insensitive to sounds in other frequency bands.