With the rise of text large model technology, the field of speech synthesis is rapidly adapting to this change. Large model-based speech synthesis technology has become an industry trend due to its excellent performance.
Although traditional speech synthesis is highly simulated in terms of sound quality and rhythm, it is still insufficient in the details of emotion and intonation in complex scenes such as audiobooks and natural conversations. The rise of large language models (llm) provides new possibilities to bridge these gaps, leading speech synthesis technology towards a more realistic and natural interactive experience.
Since Mobvoi released the first-generation tts engine, after many iterations, the speech synthesis effect has continued to approach "fake and real", comparable to real people.
Mobvoi’s speech synthesis technology continues to iterate
Since the launch of the first-generation tts engine in 2015, Mobvoi has significantly improved the authenticity of speech synthesis through continuous iterations. In August 2019, we released the advanced fourth-generation engine meetvoice, integrated it into its product line and tob services, and implemented the large-scale application of thousands of sounds in the "Magic Sound Workshop", which received wide acclaim. In the face of the rapid growth of the short video market and user demand for highly simulated voices, we have continuously optimized the meetvoice engine and added many functions including pause adjustment, high-definition sound quality, and intonation control.
Now, Mobvoi's self-developed large-scale model "Sequence Monkey" has achieved significant breakthroughs. Its language-centered capability system, , covers the six dimensions of "knowledge, dialogue, mathematics, logic, reasoning, and planning" . In particular, this model has excellent cross-modal knowledge transfer capabilities and can effectively transform the common sense knowledge covered by the language model into other non-language modal models. Based on this technology, the development team used the cutting-edge text large model technology of to build an advanced speech synthesis system - meetvoice pro, which is Mobvoi's sixth-generation TTS engine . This system is based on the text model capabilities of Sequence Monkey. Through deep learning training on massive speech samples, it can produce highly natural and expressive synthetic sounds, making the AI dubbing effect close to the level of real human voices .
"Sequence Monkey" empowers the speech synthesis engine
In order to deeply understand the technical points of the new generation speech synthesis engine we have developed, let us gradually sort out its core architecture.
01 Voice tokenization
First of all, the key issue we need to solve is to effectively convert the voice signal into a form that can be processed by the machine. Unlike the discrete nature of text data, the speech signal appears as a continuous waveform, which poses an initial challenge to the speech synthesis engine. To address this problem, we adopted the encoder-decoder architecture strategy that is widely recognized in the industry to achieve effective discretization of continuous speech signals. Through this architecture, voice data is first decomposed into a series of discrete units, so-called "voice tokens". This process not only lays a solid foundation for subsequent speech generation, but also ensures the naturalness and fluency of the synthesized speech.
Speech codec architecture diagram
02 Modeling text and voice tokens
In the process of modeling text and voice tokens, our self-developed large-scale sequence model "Sequence Monkey" plays an important role. The model leverages its advanced underlying textual foundation capabilities to achieve in-depth understanding and accurate simulation of polyphonic characters, prosody, and contextual relationships, and then effectively map (or transfer) these text attributes to the speech domain. In this way, "Sequence Monkey" not only improves the quality of speech token generation, but also enhances the model's ability to handle complex speech phenomena.
Speech synthesis framework based on the large model "Sequence Monkey"
Three advantages Promote real human voice experience
With the support of the new framework, this speech synthesis technology presents three outstanding advantages, and has gained in authenticity A huge improvement.
01 Automatically adjusts emotion and rhythm
New technology can lower the pitch and increase softness when telling a sad story, or increase the speed and enhance the excitement of the tone when sharing exciting news.Such intelligent adjustment makes the synthetic voice experience more natural and expressive, as if you are in a real human conversation.
02 sound cloning only takes a few seconds
sound cloning has become extremely efficient. It can quickly learn audio samples of only a few seconds and generate highly realistic audio. In this way, the traditional time-consuming recording process and training process will become a thing of the past. For example, we were able to take short-lived original recordings of Elon Musk and Steve Jobs and easily clone very similar voices in just a few seconds.
03 Cross-language timbre migration
This technology has strong cross-language capabilities and has achieved seamless conversion of audio in different languages into the same timbre, Chinese or English, so that speakers of small languages can communicate fluently in Chinese or English. For example, we can ask a girl whose native language is Thai to use her own timbre to introduce herself fluently in English and recite ancient poems in Chinese.
Ultimate Pronunciator Suitable for multiple scenarios
Among the many online pronunciators, we have selected the best from the best and selected a group of unique and high-quality voices to recommend to the majority of content creators.
01 Audio book
02 Movie and TV commentary
03 Other features
Free for a limited time, experience with gift
From January 31st to February 28th, "Magic Sound Workshop" will launch a special event, the Ultimate Speaker Series will be free to all svip members, non-members Users can use the cdk redemption code aigc2024 to get 1-day svip membership for free to experience. Welcome to click on the following mini program to use the corresponding pronunciator.
If you encounter any questions or comments during the experience, you can provide instant feedback in the background of the official account. We will randomly give participants a one-day svip membership experience.
Currently, the cumulative number of users served by Mobvoi aigc products has exceeded 12 million, and the number of registered users has exceeded 8 million, of which the number of paid users has exceeded 600,000. According to the CIC Consulting Industry Report, Mobvoi is the earliest and largest artificial intelligence company in Asia focusing on generative AI.