Towards Listening to 10 People Simultaneously (Sinkhorn PIT) – Demo Page
H. Tachibana, Towards listening to 10 people simultaneously : an efficient permutation invariant training of audio source separation using Sinkhorn’s algorithm
Trained on the LibriMix train+valid sets, and tested on the LibriMix test set (5-mix and 10-mix).
The LibriMix is based on the LibriSpeech , an English ASR corpus.
Notes on LibriSpeech
LibriSpeech is distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
V. Panayotov et al. “LibriSpeech: An ASR Corpus based on Public Domain Audio Books,” Proc. ICASSP 2015.
Changes:
Each data was resampled from 16kHz to 8kHz.
Each data was trimmed to align the length of each data set.
Multiple data were mixed to simulate a situation where several people were talking simultaneously.
5-mix
Example 1
speaker ID
Estimated
Ground Truth
Input SI-SDR
Output SI-SDR
SI-SDR imp
1
–2.66 dB
5.56 dB
8.21 dB
2
–6.66 dB
5.72 dB
12.39 dB
3
–4.77 dB
8.14 dB
12.91 dB
4
–9.94 dB
–1.61 dB
8.32 dB
5
–8.16 dB
3.73 dB
11.89 dB
average
10.75 dB
Example 2
speaker ID
Estimated
Ground Truth
Input SI-SDR
Output SI-SDR
SI-SDR imp
1
–8.69 dB
–1.31 dB
7.38 dB
2
–11.86 dB
0.41 dB
12.27 dB
3
–5.35 dB
–3.04 dB
2.31 dB
4
–2.15 dB
–1.13 dB
1.02 dB
5
–5.68 dB
–7.34 dB
–1.66 dB
average
4.26 dB
Example 3
speaker ID
Estimated
Ground Truth
Input SI-SDR
Output SI-SDR
SI-SDR imp
1
–5.95 dB
8.02 dB
13.97 dB
2
–3.14 dB
7.26 dB
10.40 dB
3
–7.91 dB
4.00 dB
11.91 dB
4
–8.04 dB
1.16 dB
9.20 dB
5
–5.75 dB
9.18 dB
14.93 dB
average
12.08 dB
10-mix
Example 1
This test data is the same as the one shown in the paper.
speaker ID
Estimated
Ground Truth
Input SI-SDR
Output SI-SDR
SI-SDR imp
1
–8.29 dB
3.95 dB
12.24 dB
2
–8.76 dB
–0.40 dB
8.37 dB
3
–10.92 dB
–11.52 dB
–0.59 dB
4
–9.26 dB
2.16 dB
11.42 dB
5
–6.52 dB
1.99 dB
8.51 dB
6
–14.13 dB
–13.59 dB
0.54 dB
7
–16.20 dB
–12.13 dB
4.07 dB
8
–11.20 dB
–5.26 dB
5.94 dB
9
–9.66 dB
–1.47 dB
8.19 dB
10
–7.52 dB
3.19 dB
10.71 dB
average
6.94 dB
Example 2
speaker ID
Estimated
Ground Truth
Input SI-SDR
Output SI-SDR
SI-SDR imp
1
–12.75 dB
–2.69 dB
10.06 dB
2
–13.67 dB
–11.22 dB
2.44 dB
3
–11.69 dB
– 7.26 dB
4.42 dB
4
–6.28 dB
0.65 dB
6.93 dB
5
–10.85 dB
–1.68 dB
9.16 dB
6
–6.70 dB
–5.38 dB
1.33 dB
7
–10.96 dB
2.13 dB
13.09 dB
8
–10.77 dB
–8.00 dB
2.78 dB
9
–8.14 dB
–0.44 dB
7.70 dB
10
–8.24 dB
–2.60 dB
5.64 dB
average
6.35 dB
Example 3
speaker ID
Estimated
Ground Truth
Input SI-SDR
Output SI-SDR
SI-SDR imp
1
–3.66 dB
3.56 dB
7.22 dB
2
–15.12 dB
–0.60 dB
14.52 dB
3
–16.71 dB
–1.94 dB
14.77 dB
4
–12.19 dB
–5.11 dB
7.10 dB
5
–7.01 dB
–0.67 dB
6.35 dB
6
–10.71 dB
–6.63 dB
4.08 dB
7
–12.28 dB
–3.93 dB
8.35 dB
8
–9.52 dB
–0.52 dB
9.00 dB
9
–8.26 dB
–4.15 dB
4.11 dB
10
–11.99 dB
–6.07 dB
5.91 dB
average
8.14 dB
Example 4: trained on the LibriMix (English data), tested on the JVS Corpus / nonpara30 (Japanese data)
JVS corpus
S. Takamichi, et al. “JVS corpus: free Japanese multi-speaker voice corpus,” arXiv preprint, 1908.06248, Aug. 2019.
Note that the system has only heard the English speech, and has never heard the Japanese speech during the training. No finetuning nor domain adaptation was applied.
speaker ID
Estimated
Ground Truth
Input SI-SDR
Output SI-SDR
SI-SDR imp
1
–12.63 dB
–9.86 dB
2.77 dB
2
–12.71 dB
–9.09 dB
3.62 dB
3
–10.26 dB
–2.27 dB
7.99 dB
4
–14.21 dB
–10.30 dB
3.91 dB
5
–16.86 dB
–13.80 dB
3.07 dB
6
–8.23 dB
–7.41 dB
0.82 dB
7
–6.99 dB
–2.88 dB
4.11 dB
8
–10.03 dB
–8.77 dB
1.26 dB
9
–3.87 dB
0.37 dB
4.24 dB
10
–9.26 dB
4.24 dB
13.51 dB
average
4.53 dB
Copyright ©️ Hideyuki Tachibana. All rights reserved.
[Home]