

For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. Microsoft trained VALL-E's speech-synthesis capabilities on an audio library, assembled by Meta, called LibriLight. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder. To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Or, as Microsoft puts it in the VALL-E paper: It basically analyzes how a person sounds, breaks that information into discrete components (called "tokens") thanks to EnCodec, and uses training data to match what it "knows" about how that voice would sound if it spoke other phrases outside of the three-second sample. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E generates discrete audio codec codes from text and acoustic prompts. Microsoft calls VALL-E a "neural codec language model," and it builds off of a technology called EnCodec, which Meta announced in October 2022. Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation when combined with other generative AI models like GPT-3. Granted, the first couple of seasons it grated on my ears, but now it just seems as if he doesn't even do any voice acting and relies entirely on the computer.Further Reading Meta’s AI-powered audio codec promises 10x compression over MP3

The jokes weren't as funny, and everything just screamed "look at me, I'm topical!"Īnd yeah, Cartman's voice has become extremely lazy. I would say the show hit its peak somewhere before the WoW episode, but really lost its touch when it seemingly became conscious of its 7-day turnaround time.Īt that point, it became a forced satire machine. The second season is similar, but it starts figuring out it can do a lot more than fart jokes and cursing. I went on a SP binge while playing WoW a year or two ago, and it was interesting seeing the show catch its stride and figure out exactly what it was doing. I always catch an episode here and there, and have no desire to watch the entire season. Yeah, the show has definitely been going downhill the past few seasons.
