Demo page of an efficiently trainable TTS

This is a demo page of the paper “Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks With Guided Attention”, ICASSP 2018.
Paper available at
- IEEE Xplore
- arXiv

@inproceedings{tachibana2018efficiently,
   author={Tachibana, H. and Uenoyama, K. and Aihara, S.},
   title={Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention},
   booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
   publisher={IEEE},
   month={Apr},
   year={2018},
   DOI={10.1109/icassp.2018.8461829}
}

Samples

Long Sentences

icassp stands for the international conference on acoustics, speech and signal processing.　

Synthesized audio (15 hours training).

Note: the words “ICASSP” and “acoustics” are not included in our training data.

generative adversarial network or variational autoencoder.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

a matrix is positive definite, if all eigenvalues are positive.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

a matrix is positive semi-definite, if all eigenvalues are non-negative.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

if the hessian matrix of a multivariate function is positive semi-definite, the function is convex.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

there always exists at least one mixed strategy nash equilibrium.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

a spectrogram is obtained by applying es-tee-ef-tee to a signal.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

the two player zero-sum game of wasserstein gan is derived by considering kantorovich-rubinstein duality.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

keras is a famous deep learning framework for python, which uses tensorflow as a backend.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

solidity is a turing complete programming language for ethereum, which is the second largest cryptocurrency based on blockchain.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

brahms’s symphony number one is sometimes called beethoven’s tenth symphony.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

a british author, kazuo ishiguro, famous for his works such as the remains of the day, was awarded the nobel prize in literature in two thousand seventeen.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

tokyo is the most populous megacity in the world, with nearly thirty-eight million population.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

in the chinese language, the word, crisis, is composed of two characters, one representing danger and the other, opportunity. (John F. Kennedy)

Note: This famous quote is not included in the training data.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours

to be, or not to be, that is the question. (Shakespear)

Note: This famous quote is not included in the training data.

training time	synthesized audio
2 hours
7 hours
15 hours
40 hours