Hear me out 👂

May 5, 2023

While text, image, and video generation has been all the rage lately, audio initiatives have largely flown under the radar until an AI-generated Drake rap made some waves recently. So, let’s have a look at what’s going on, and there is a lot happening!

Starting with the AI raps—these are primarily voice synthesis overlaid on traditional music productions. It’s worth noting that incumbents of the music industry hold much more sway in terms of defending copyrights than the still image industry. Hordes of big label lawyers are issuing takedown notices for AI rap by the minute. As for the artists, more have been chiming in recently, highlighting the creative potential of generative AI. Peter Gabriel lent his music to a Stability AI competition for AI-generated video: and Grimes launched elf.tech, a platform enabling voice generation using her voice print as well as royalty sharing of any derivative works. It will be interesting to see how all this creative destruction plays out.

Personally, I really like how AI models expand the possibility space for music and audio. How would Kanye country music sound? The Kanye Western AI account on YouTube delivers some wonderful genre-bending renditions with the voice of the renowned rapper. I thoroughly enjoy this Billy Currington cover, for example:

Or how about necrorap? Reviving Notorious B.I.G. is already somewhat of a trend, with Timbaland getting to “collaborate” with Biggie posthumously or Biggie and Tupac “covering” Jay-Z and Kanye’s Niggas in Paris here:

https://www.youtube.com/watch?v=URqnmyiVeLk

Now let’s have a look at the technology (finally), starting with the most recent breakthrough released only a couple of weeks ago, namely Bark. This model uses the now ubiquitous transformers architecture, unlike earlier popular voice generation models. The model is designed to generate highly realistic, multilingual speech as well as other audio content. One of its remarkable capabilities is its ability to fully clone voices, including aspects like tone, pitch, emotion, and prosody.

Among the earlier models that are used for the AI raps, Coqui TTS gets a notable mention as it implements no less than 15 audio synthesis papers. PaddleSpeech is also quite popular and uses several state-of-the-art models. 🎤

If we look beyond voice models, there is a lot of activity in audio and music generation more generally. In December last year, the Riffusion project demonstrated that a stable diffusion model fine-tuned on spectrogram images could generate new spectrogram images, which could then be turned into audio using a Fourier transformation. It is surprisingly effective since a spectrogram image contains the frequencies and amplitudes of the sound but lacks phase (the direction of the audio wave), but brr goes the GPUs and out come audio. Building on this, several models have been released. ♬

The two state-of-the-art models in this space currently are Tango and AudioLDM. While these models can produce an amazing range of audio clips from text, they still suffer from lower sampling rates than we are used to. However, much like the resolution of images generated with diffusion models, the sampling rate is expected to increase with more powerful hardware and more efficient computation to catch up to the standard CD quality of 44.1 kHz. 🎶

There’s some way to go in order to generate full, coherent songs, but the ability to generate any audio is spurring innovation beyond this use case. For example, Multi-instrument Music Synthesis with Spectrogram Diffusion applies the concept of a synthesizer to the technology and enables audio generation guided by MIDI clips.

All of the above models are slow and suffer from low sampling rates, but RAVE is trying to address this with its fast and high-quality variational autoencoder. This approach allows artists to explore the latent space of whatever the model is trained on and integrate these manifestations into their audio processing flows. Here’s a pretty amazing motion-to-sound implementation of RAVE.

Small excerpt from my defense where I showcase how RAVE can be used to generate sounds based on motion, in real-time ! This is a really fun way to control a synthesis system, and we’ll be working on improving this in the upcoming months 😉 pic.twitter.com/lQJbtq725L
— Antoine Caillon (@antoine_caillon) March 1, 2023

Lastly, and getting back to AI rap, there are not many released tools for musicians to leverage AI technology in their usual audio workstation workflows, but Synthesizer V Studio is making strides here: