Google open source framework FUSS, so that sound separation is no longer a problem

Recently, we are pleased to announce the release of FUSS, the free universal sound separation data set (Free Universal Sound Separation).

Usually the recorded audio may contain many different sound sources. General sound separation capability means that no matter what type of sound is included in the audio, such mixed audio can be decomposed into different sounds according to its composition. Before the advent of this method, the work of sound separation focused more on decomposing the mixed audio into a small number of specified types of sound, such as “voice” and “non-speech”, or into different instances of the same type of sound, such as speaker 1 With speaker 2. Moreover, usually in such sound decomposition work, the number of sounds in the mixed audio is also assumed to be a priori, that is, known in advance. However, the FUSS data set released this time shifts the focus to solving the more general problem of separating a variable number of arbitrary sounds from the mixed audio one by one.

In this field, one of the main obstacles to overcome in training models is that even with high-quality mixed audio recordings, it is not an easy task to label these recordings with ground truth. High-quality simulation is a solution to overcome this limitation. In order to obtain good simulation results, a diverse set of various sounds, a realistic room simulator, and code that mix these elements together are required to achieve realistic, multi-source, and multiple types Audio and mark it with the reference truth. Using the newly released FUSS data set, we can achieve audio simulation with these three characteristics.

FUSS relies on audio clips with Creative Commons (Creatuve Cinnibs) license from freesound.org. Our team filters and searches out these sounds based on the license type, and then uses the pre-release version of FSD50k to further filter out sounds that cannot be separated when mixed together. After these filters, there are about 23 hours of audio, including 12,377 sounds, which can be used for machine learning of mixed sounds. In our study, 7237 voices were used for training; 2883 were used for verification; 2257 were used for evaluation. Using these audio clips, we created 20,000 training mixed audios, 1,000 verified mixed audios, and 1,000 evaluated mixed audios.

On the open source machine learning platform TensorFlow, we have developed our own room simulator. Given the location of the sound source and the position of the microphone, the room simulator can generate the impulse response of a box-shaped room with frequency-dependent sound reflection characteristics. As part of the release of the FUSS data set, we provide pre-calculated room impulse responses and mixed codes for each audio sample, so the audio research community can directly use this data set to simulate new audio without having to run Mass calculations required by room simulators. Next, we will continue to work may include the release of room simulator code, and extended room simulator functions to deal with richer acoustic characteristics, such as different reflective properties of materials, irregular room shapes, and so on.

Finally, we also released a mask-based separation model based on an improved time-domain convolutional network (TDCN ++). In the evaluation data set, the model successfully achieved a 12.5 dB scale-invariant signal-to-noise ratio improvement (SI-SNRi) when processing mixed audio from 2 to 4 signal sources, while reconstructing a 37.6 dB absolute scale invariant signal Single source mixed audio with noise ratio.

You can find the source audio, reverb impulse response, reverb mixed audio, and sound source created by the mixing code, and a download of the baseline model checker. You can also find the reverb, mixed audio data, and all relevant code used to publish model training on our github page (address: https://github.com/google-research/sound-separation).

As a functional component of sound event detection and separation tasks, this data set will also be used in the DCASE challenge initiated by IEEE. The model we released will be used as the benchmark for this IEEE network competition and as a standard inspection procedure to show progress in future experiments.

We hope that this data set will help you clear the barriers to new research, especially to help the rapid iteration of new technologies in other machine learning fields in the future and to deal with the challenges of sound separation research.