A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Freixes, Marc; Alías-Pujol, Francesc; Socoró, Joan Claudi

dc.contributor	Universitat Ramon Llull. La Salle
dc.contributor.author	Freixes, Marc
dc.contributor.author	Alías-Pujol, Francesc
dc.contributor.author	Socoró, Joan Claudi
dc.date.accessioned	2020-03-27T15:11:35Z
dc.date.accessioned	2023-10-02T06:42:13Z
dc.date.available	2020-03-27T15:11:35Z
dc.date.available	2023-10-02T06:42:13Z
dc.date.created	2019
dc.date.issued	2019-12
dc.identifier.uri	http://hdl.handle.net/20.500.14342/3421
dc.description.abstract	Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, time-scale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.	eng
dc.format.extent	14 p.
dc.language.iso	eng
dc.publisher	SpringerLink
dc.relation.ispartof	EURASIP Journal on Audio, Speech, and Music Processing. 2019:22
dc.rights	© L'autor/a
dc.rights	Attribution 4.0 International
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.source	RECERCAT (Dipòsit de la Recerca de Catalunya)
dc.subject.other	Parla
dc.title	A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept
dc.type	info:eu-repo/semantics/article
dc.type	info:eu-repo/semantics/publishedVersion
dc.rights.accessLevel	info:eu-repo/semantics/openAccess
dc.embargo.terms	cap
dc.subject.udc	78
dc.identifier.doi	https://doi.org/10.1186/s13636-019-0163-y
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO i FEDER/PN I+D Excelencia/TEC2016-81107-P
dc.relation.projectID	info:eu-repo/grantAgreement/SUR del DEC i FSE/FI/2016FI_B2 00094
dc.relation.projectID	info:eu-repo/grantAgreement/URL i La Caixa/Intensificació recerca PDI/2018-URL-IR1rQ-021
dc.relation.projectID	info:eu-repo/grantAgreement/URL i La Caixa/Intensificació recerca PDI/2018-URL-IR2nQ-029