Implementation details
Spleeter contains pre-trained models for:
• vocals/accompaniment separation.
• 4 stems separation as in SiSec (Stöter, Liutkus, & Ito, 2018) (vocals, bass, drums and
other).
• 5 stems separation with an extra piano stem (vocals, bass, drums, piano, and other). It
is, to the authors’ knowledge, the rst released model to perform such a separation.
The pre-trained models are U-nets (Jansson et al., 2017) and follow similar specications as
in (Prétet, Hennequin, Royo-Letelier, & Vaglio, 2019). The U-net is an encoder/decoder
Convolutional Neural Network (CNN) architecture with skip connections. We used 12-layer
U-nets (6 layers for the encoder and 6 for the decoder). A U-net is used for estimating a
soft mask for each source (stem). Training loss is a L
1
-norm between masked input mix
spectrograms and source-target spectrograms. The models were trained on Deezer’s internal
datasets (noteworthily the Bean dataset that was used in (Prétet et al., 2019)) using Adam
(Kingma & Ba, 2014). Training time took approximately a full week on a single GPU.
Separation is then done from estimated source spectrograms using soft masking or multi-
channel Wiener ltering.
Training and inference are implemented in Tensorow which makes it possible to run the code
on Central Processing Unit (CPU) or GPU.
Speed
As the whole separation pipeline can be run on a GPU and the model is based on a CNN,
computations are eciently parallelized and model inference is very fast. For instance, Spleeter
is able to separate the whole musdb18 test dataset (about 3 hours and 27 minutes of audio)
into 4 stems in less than 2 minutes, including model loading time (about 15 seconds), and
audio wav les export, using a single GeForce RTX 2080 GPU, and a double Intel Xeon Gold
6134 CPU @ 3.20GHz (CPU is used for mix les loading and stem les export only). In this
setup, Spleeter is able to process 100 seconds of stereo audio in less than 1 second, which
makes it very useful for eciently processing large datasets.
Separation performances
The models compete with the state-of-the-art on the standard musdb18 dataset (Rai et al.,
2017) while it was not trained, validated or optimized in any way with musdb18 data. We
report results in terms of standard source separation metrics (Vincent, Gribonval, & Fevotte,
2006), namely Signal to Distortion Ratio (SDR), Signal to Artifacts Ratio (SAR), Signal to
Interference Ratio (SIR) and source Image to Spatial distortion Ratio (ISR), are presented
in the following table compared to Open-Unmix (Stöter, Uhlich, Liutkus, & Mitsufuji, 2019)
and Demucs (Défossez, Usunier, Bottou, & Bach, 2019) (only SDR are reported for Demucs
since other metrics are not available in the paper) which are, to the authors’ knowledge, the
only released system that performs near state-of-the-art performances. We present results
for soft masking and for multi-channel Wiener ltering (applied using Norbert (Liutkus &
Stöter, 2019)). As can be seen, for most metrics Spleeter is competitive with Open-Unmix
and especially on SDR for all instruments, and is almost on par with Demucs.
Spleeter Mask Spleeter MWF Open-Unmix Demucs
Vocals SDR 6.55 6.86 6.32 7.05
Vocals SIR 15.19 15.86 13.33 13.94
Hennequin et al., (2020). Spleeter: a fast and ecient music source separation tool with pre-trained models. Journal of Open Source Software,
5(50), 2154. https://doi.org/10.21105/joss.02154
2