Contributed by Tencent PCG ARC Laboratory Qubits | Public Account QbitAI A large multi-modal model that can process music has finally appeared! I saw that it accurately analyzed the melody and rhythm of the music, as well as the instruments used, and even the meaning...

2024-01-22 17:13:02 entertainment 4369℃

Tencent pcg arc laboratory contributed

qubit | Public account qbitai

A large multi-modal model that can process music has finally appeared!

saw that it accurately analyzed the melody and rhythm of the music, as well as the instruments used, and even the artistic conception.

[Please go to the official account for the music effect]

And it can not only listen, as long as you give it a text and picture, it will create it based on the text requirements after understanding the artistic conception of the picture:

[Please go to the official account for the music effect]

can even add sound to silent videos:

[Please go to the official account for music effects]

It can also edit existing music, such as removing the sound of drums from a piece of music

[Please go to the official account for music effects] These effects above

all come from the newly launched multi-modal model-based music understanding and generation framework m2ugen by Tencent pcg arc laboratory.

It can perform music understanding, music editing and multi-modal music generation (text/image/video to music generation). The

research team compared the five capabilities of the model with existing models one by one, and conducted subjective evaluation experiments on three subtasks of multi-modal music generation (text/image/video to music generation) and found that The performance of m2ugen model is better than existing models.

In addition, since there are not many suitable data sets for model training, the research team also developed a set of data generation methods and produced four data sets of mucaps, muedit, muimage, and muvideo and released them. At present, the

team has open sourced the model code library on github, and opened the model weights and training data sets on huggingface (application required).

So, how is m2ugen implemented? The

model is divided into four modules. The

m2ugen model is divided into four module areas, namely multimodal feature encoder, multimodal understanding adapter, bridge llm, and music understanding and generation module.

The following figure shows the overall framework of the m2ugen model:

Multimodal feature encoder

In order to achieve multimodal music understanding and generation, the model needs to process multimodal input.

Based on this, the research team applied some existing modal encoders, such as the music encoder mert, the image encoder vit, and the video encoder vivit.

vit and vivit are two transformer-based encoders commonly used in the visual field. They are often involved in some existing LLM-related work, so developers choose these two as image and video encoders respectively.

For music input, the previous work mu-llama proved that the mert model is significantly better than other compared audio/music encoders, so the research team in m2ugen selected mert as the encoder for music input.

multi-modal understanding adapter

The main function of this module is to integrate the information of the feature vector output by the encoder and input it into the subsequent llm and control the output of the llm together with the text input.

is shown in the figure below. This module mainly consists of a 1d convolution layer, a linear mapping layer and a dense network module.

The final dense network module is as shown below:

This module consists of three sub-modules, including regularization layer (nomarlization), linear layer (linear), activation function (silu) and other components.

This process can be expressed by the following formula:

where xi represents the output embedding vector after the i-th sub-module, lj, i represents the j-th linear layer of the i-th sub-module, ni represents the regularization layer within the i-th sub-module, silu is the activation function.

This dense network design continues from the team's previous work mu-llama. After the dense network,

outputs a 4096-dimensional embedding vector, which is provided to the downstream llm.

bridges llm

In order to introduce multimodal context information into llm, researchers connect the output from the adjacent upstream multimodal understanding adapter to the specified layer of llm.

researchers used the llama 2 model developed by meta as the base llm, as shown in the figure below.

The model version selected here is the llama 2 7b model, which contains n=32 hidden layers.

Counting from the top of the model, every l layer (l=6) introduces modal information, introduces music, image and video modal information from top to bottom, and uses the zero initial value attention module, the bottom ( The n-3l-1) layer uses the original attention module. The text instructions of

llm are input from the bottom layer, that is, the first layer. Using this technology, llm is given the ability to guide llm output through other modal information.

Music understanding and generation module

Inspired by the work of next-gpt, this model introduces a specific audio tag [aud] to distinguish music question answering and generation tasks.

During the model training phase, for training sample pairs (such as text instruction-music pairs) with music as the output (i.e., music generation task), these audio markers will be added at the end of the llm output to indicate the downstream music output.

In the model inference phase, if the instruction input by the user is related to music generation, such as generate a music using flute (generate a piece of music with a flute), the output of llm will contain audio tags, so the downstream music decoder will receive the instruction and Generate music with the flute as a musical instrument;

On the other hand, if the output of llm does not have audio tags, it indicates that the user expects a music understanding task, and llm will directly respond to the user's question.

researchers tried two music decoders-audioldm 2 and musicgen, among which musicgen's music generation performance was better than audioldm 2.

proposes a new data set, and the training is divided into three stages

training data set

As mentioned in the research contribution of this article, this study constructed four sets of data sets mucaps, muedit, muimage and muvideo. Examples of data samples are shown in the figure below.

mucaps data set :

contains about 1200 hours of public music from audioset and some websites;

uses the mu-llama model to generate music annotations for the collected music files to form music-text pairs.

muedit data set :

builds a music pool from audioset (the music pool is different from mucaps), and filters out about 60 hours of similar music-music pairs;

filtering conditions include speed (tempo), beats (beats), etc., thus obtaining a general Music-music pairs that are similar, but have certain differences (such as different instruments used);

regards music-music pairs as source-target pairs, and inputs the annotation text of the source music to the mpt-7b[14] model to obtain human In the end-to-end dialogue, the annotation text of the target music is input to the mpt-7b model to obtain the model-side dialogue, that is, both the source music and the target music receive corresponding instructions for model training.

muimage/muvideo data set :

additionally samples some image/video-music pairs from audioset (different from the music in mucaps/muedit, minimizing music duplication), and uses the blip/videomae model to make image/video of the image/video Annotation;

inputs the annotation text of the image/video + music into the mpt-7b model, and obtains human-side and model-side dialogue respectively.

The above data set construction script can be found at:

https://github.com/shansongliu/m2ugen/tree/main/datasets

The training of the m2ugen model refers to the next-gpt training method, which is divided into three stages, namely the encoding side training, decoder side training and codec joint training.

Phase 1: Encoder side training

This phase freezes the multimodal encoder and llm, and only trains the multimodal understanding adapter;

uses music/image/video-text pairs from mucaps/coco/muvideo to do phase 1 training;

The training loss is cross-entropy loss, which compares the output of llm with the target annotation text.

Phase 2: Decoding end training

This phase does not consider the encoding test (modal encoder and adapter), freezes the llm, and trains the output mapping module;

This phase is designed to train the llm to generate instructions that instruct the downstream music decoder to output music. Or directly make questions and answers or annotations on the input music based on the input instructions;

needs to align the text encoder output of the music decoder (audioldm 2/musicgen) and the conditional embedding vector generated by the m2ugen model output mapping module, that is, perform the output end Alignment;

During training, this stage indicates whether to generate music by adding a specific audio tag [aud]. If [aud] is included in the output of llm, text + music (music generation) are generated at the same time, if not, only text (music question and answer) is generated;

loss function uses cross entropy and mean square error, where cross entropy is the comparison The audio tokens output by llm and the ground-truth audio tokens, the mean square error is compared between the conditional embedding vector produced by the m2ugen model output mapping module and the output text embedding vector of the text encoder of the music decoder.

Phase 3: Encoding and decoding joint training

At this stage, the multimodal encoder and llm are frozen, and the multimodal understanding adapter and output mapping module are trained, as well as the lora parameters in llm;

The training data at this stage of training has alpaca (general knowledge ), musicqa, muimage, muvideo and muedit;

In order to enable the model to generate music and text at the same time, the three data sets muimage, muvideo and muedit added specific audio tags to the llm output during stage 3 training (similar to stage 2 training) ).

In the future, the research team will focus on further improving the model's fine-grained music understanding capabilities, improving the correlation between generated music and input instructions, and making the music editing capabilities more accurate.

paper address: https://arxiv.org/pdf/2311.11255.pdf

Tags： entertainment

Prev post： Because the film and television industry is in a hurry for the storage of outstanding screenwriters, while novels are being adapted into films in large numbers, capitalists are also targeting big IPs that have achieved outstanding results. Although the adaptation of classic film

Next post： Charmaine Sheh, who just won three trophies, seems to be in a really good mood. It was reported by Hong Kong media that she spent 30 million Hong Kong dollars to buy a luxury house. But when mainland netizens took a look, it was only 96 square meters. Is it considered a luxury ho