Latorre, J and Gales, MJF and Buchholz, S and Knill, K and Tamura, M and Ohtani, Y and Akamine, M (2011) Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. pp. 4724-4727. ISSN 1520-6149Full text not available from this repository.
Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model. © 2011 IEEE.
|Divisions:||Div F > Machine Intelligence|
|Depositing User:||Cron Job|
|Date Deposited:||09 Dec 2016 17:23|
|Last Modified:||29 Mar 2017 03:13|