synthesize me some tunes!
April 10, 2021
Symbolic music generation is all about generating abstractions, like scores or MIDI. The word “abstraction” is fairly abstract (haha!), but an analog is the relationship between text and speech. As humans, we learn to turn the visual symbols of characters (for example,
dog) into sounds in the real world, and we learn how to take the sounds we hear in the real world and turn it into characters.1 This forms the basis of written and verbal communication. We can then call text an abstraction on top of speech, meaning that there is low-level information that is missing (like intonation, timing, even pronunciation sometimes), but those gaps are filled in when the text is interpreted.
In the same way, sheet music is an abstraction on top of real music. It stores incomplete information that can be interpreted by a musician. This is part of the reason that performing music is difficult — not only do we need to build physical ability, we also need to understand what the composer is telling us to do (and where not to listen to them).2 Symbolic music generation focuses on generating the text of music, letting the interpretation happen elsewhere.
This approach is one of the easier ways to generate music. MIDI is widely used, easy to work with and has produced good results so far (see my article to learn more about common techniques). It works very well for piano music, since MIDI captures pretty much all the degrees of freedom a pianist has — pitch, volume and timing. This means that a model has pretty much all the creative latitude that Beethoven did.
A major limitation of this method of generation is the lack of information when it comes to other instruments. MIDI doesn’t come close to capturing the degrees of freedom available to a saxophonist. the position of the mouth on the mouthpiece, speed of the air, thickness of the reed, openness of the throat all affect the sound of the instrument, and none of them can be accurately captured in MIDI (barring heavy modification). Thus any MIDI-based symbolic model is severely limited in its ability to create music for instruments other than piano.
cut the middleman
One potential method around these limitations is to skip the abstraction step — just generate the raw audio (i.e. waveforms, spectrograms etc.). This might allow models to fully capture the degrees of freedom available to musicians, without having to manually design each and every variable of a certain instrument.
You can think of this as a baby learning to speak. They don’t necessarily understand the concept of words and sentences, but they begin learning (from their surroundings) how to make sounds and how to communicate what they need to, slowly building up the ability to make precisely the sounds they need to make to get more complex messages across.
There’s one key issue in this analogy to babies though — babies are using the same hardware as we are. We assume that a baby has pretty similar vocal capabilities as an adult human — they can produce the same sounds we can with the same machinery.3 This isn’t the case with music and raw audio though. A raw audio-based model has an output space that encompasses the better part of all possible sounds. It can generate white noise, speech, music, animal sounds; with some finicking, we could even get the global average temperature for all of recorded history. If a baby was able to create every possible sound at will — how would they ever be able to narrow down into human speech?
As it turns out, there have been raw audio models that have gotten past this information problem. WaveNet used a convolutional structure, allowing it to model structure across longer time ranges (its receptive field grows exponentially with depth). It's very good at modeling local structure audio (like the timbre of an instrument), which is especially important in a TTS context. Some examples of music generation were also shown, and they sound very much like a piano. However, it's not able to model the longer-term structures in music (i.e. melodies, motifs, chord progressions etc.).
The convolutional structure might suggest that modeling higher-order structure is easy — just add a couple more layers! The practical issue that arises, though, is that training excerpts need to be long enough to show structure at the timescale we need to model (i.e. we need excerpts of 30 seconds if we want to model 30-second correlations). That means that, while the depth grows at a manageable pace, the computational cost for training grow much faster (logarithmically vs. linearly). Going from 1 to 100 seconds takes orders of magnitude more computation power.4
This is a problem that faces raw audio generation in general. Much of a model’s capacity (and our computational power) goes to low-level details on the order of ten thousandths of a second.
Figuring out how to model these low-level details (for good audio) and higher-level structure (for good music) is the big gap we have to cross to get to high-quality generative music models
wait, middleman! come back!
While this chasm is far from solved, there have been various approaches that we can examine. We will examine 3 of them: Hierarchical WaveNets (DeepMind), Jukebox (OpenAI) and Wave2MIDI2Wave (Google Magenta). All of them try in some way to handle the information density of raw audio.
The first two deal with the problem by dealing with the high-level and low-level in separate models. Taking inspiration from the success in modeling higher-level structure in symbolic approaches, we can try and create (more specifically, learn) our own high-level abstraction on top of music. We train one model with the abstraction, and another lower-level model with raw audio conditioned with the high-level representation. Essentially, the model is creating its own text and simultaneously learning how to convert that text into real audio. This disentangles the higher-level problem and the lower-level problem.
This process of learning an abstraction can be repeated multiple times to get to higher orders of abstraction.
Hierarchical WaveNets uses multiple WaveNets, each creating an abstraction on top of the lower model's representation, creating a hierarchy of WaveNets using a VQ-VAE (vector-quantized variational autoencoder - quite the mouthful) to generate the higher-order abstractions. The basic idea is to use the traditional VAE structure (like in MusicVAE) but put a quantisation bottleneck to "discretize" the outputs of the encoder before putting it through the decoder. In essence, it will map whatever the encoder produces to some discrete collection of vectors which we'll call a codebook. The image I have in my mind is a chef trying to write down their apple pie recipe as they make it — they have to pick and choose how to round off the precise timings and measurements they've made. It's not that helpful for me, the reader, to know that they used precisely 14139 grains of rice; 1 cup is enough precision:)
OpenAI's [Jukebox] also uses a VQ-VAE, with various improvements made that specifically prevent codebook collapse, where the model stops using some of its discrete vectors. Both models still struggle with a tradeoff between longer-term structure and audio quality.
Wave2MIDI2Wave takes a different approach, using a dataset of 172 hours of raw piano performance turned into MIDI using Onsets and Frames (MAESTRO). This is the Wave2MIDI. This has the benefit of having very well-aligned note labels with the raw audio. It then trains a generative Transformer (specificially Music Transformer) on the MIDI to generate new MIDI. It then trains a WaveNet conditioned on MIDI (and trained on MAESTRO) to turn the MIDI back into raw audio. This is the 2Wave part of it again.
This approach, while generating very high-quality samples, is limited to styles of music where MIDI is an accurate representation. It has the same pitfalls as pure symbolic music generation, insofar as it relies on MIDI to accurately represent music.
The ideal scenario here seems to be learning an optimally compressed abstraction that we can then start bashing away at with our language models. SlowAEs are a significant step in this direction, learning an event-based structure amenable to run-length encoding.
Maybe one day we'll have the computational power to train WaveNets on hour-long correlations, but until then a hybrid of symbolic and raw generative techniques seems like the way forward.
It should be noted that this doesn't deal with the purpose of these sounds an characters - communication. How do we condense any of the myriad types of dogs into three characters
dog? That's outside the scope of this article. ↩
look, they'll get there eventually. ↩
Just one second of audio corresponds to a sequence with 16,000 timesteps at 16kHz, so these sequences are already very long. ↩