First published: 2018/06/26 (6 years ago) Abstract: Realistic music generation is a challenging task. When building generative
models of music that are learnt from data, typically high-level representations
such as scores or MIDI are used that abstract away the idiosyncrasies of a
particular performance. But these nuances are very important for our perception
of musicality and realism, so in this work we embark on modelling music in the
raw audio domain. It has been shown that autoregressive models excel at
generating raw audio waveforms of speech, but when applied to music, we find
them biased towards capturing local signal structure at the expense of
modelling long-range correlations. This is problematic because music exhibits
structure at many different timescales. In this work, we explore autoregressive
discrete autoencoders (ADAs) as a means to enable autoregressive models to
capture long-range correlations in waveforms. We find that they allow us to
unconditionally generate piano music directly in the raw audio domain, which
shows stylistic consistency across tens of seconds.