next up previous
Next: Discussion Up: Unit selection without a Previous: Cluster based unit selection

Evaluation

Three levels of tests were carried out. The first was within a particular dialect to confirm that pronunciations of letters in different context were properly treated.

Looking at the decision cluster trees, we can see that letter context (and position in word) is being used to differentiate between the multiple realizations of a letter-phone. For example, both voices managed to learn the distinction between the 3 different ways to pronounce ``c'' (/k/, /ch/ and /th/ or /s/, depending on the dialect) and the 2 different ways to pronounce ``g'' (/g/ and /j/).


Word 		 Castillian 		 gloss 

casa /k a s a/ house
cesa /th e s a/ stop
cine /th i n e/ cinema
cosa /k o s a/ thing
cuna /k u n a/ cradle
hechizo /e ch i th o/ charm, spell
In Spanish the letter ``c'' may be pronounced /k/, /ch/ and /th/ or /s/ (depending on dialect). The choice of phone is determined by the letter context.
Other examples of how these letter-based voices learned context sensitive differences can be found in the sequences ``que-'', ``qui-'', ``gue-'' and ``gui-'', where the ``u'' does not get pronounced (quien, querida, guerra, guitarra), and in the single ``r'' when it apears at the beginning of the sentece, pronounced /rr/ (rosado) as opposed as when it appears in an intervocalic position, pronounced /r/ (coral).

This shows that given enough audio examples of all these different contexts, the synthesizers were able to learn context-sensitive differences, and thus not knowing what the phoneme set of a language is, it is still possible to build a voice for that language.

The letter ``x'' performed worst as the systems seem to always pronounce it as /ks/ but in many cases it should be pronounced as /s/. This may be caused by the relatively rare occurrence of the letter in the database, only 52.

The second level of evaluation we investigated is how dialect differences are reflected. The most obvious difference between Castillian Spanish and Colombian Spanish is the use of /th/ and /s/ for the letters ``c'' and ``z''.


Word 		 Castillian 		  Colombian 		 gloss 

caza /k a th a/ /k a s a/ hunting
cesa /th e s a/ /s e s a/ stop
cine /th i n e/ /s i n e/ cinema
hechizo /e ch i th o/ /e ch i s o/ charm, spell
Dialectal differences for the letters ``c'' and ``z'' captured correctly by our two voices
The third evaluation was less specific to particular identifiable phenomena, and focused on the overall synthesis quality. Two short paragraphs were taken from La Vanguardia (May 20, 2002) and were synthesized by each of the two voices.

Sevilla, Agencias. Los sindicatos UGT y CC.OO. han exigido al presidente del Gobierno, José María Aznar, que convoque la mesa de negociación de la reforma del sistema de protección por desempleo, tras reunirse con el presidente de la Junta de Andalucía, Manuel Chaves, y el de Extremadura, Juan Carlos Rodríguez Ibarra.

El secretario general de UGT, Cándido Méndez, junto al responsable de CC.OO., José María Fidalgo, reiteró la necesidad de que sea Aznar quien convoque y esté presente en esta mesa, si bien precisó que esta reunión no servirá para nada si la cita no comienza con el anuncio del Gobierno de que retirará su actual propuesta de reforma.
This passage consists of 109 words. The synthesized versions from the Castillian and Colombian letter based synthesizers were listened to by a native Spanish speaker (who is the Castillian speaker and an author of this paper). Each word was assigned a value of good, poor or bad.
Dialect good poor bad % good
Castillian 102 6 1 93.57%
Colombian 99 5 5 90.82%
Where ``poor'' is defined to be words that are not clearly synthesized. An example of token that was labeled as ``poor'' is CC.00. which was not analyzed properly and thus it was assigned a default error word pronunciation.

It should be noted that as no hand correction to the labels were done and some of these errors are due to more conventional unit selection errors than to the letter/phone restrictions that we are imposing on these particular builds. Hand correction of segmental boundaries is always worthwhile in a unit selection synthesizer but at this stage we did wish to introduce that complication in this experiment.

The one phonetic error in the Castillian voice was ``Sevilla'' pronounced as /s e v i l a/ rather than /s e v i y a/.

The Colombian voice also made the same error in ``Sevilla'' and actually pronounced as /s/ an instance of a ``c'' which should be pronounced as /k/ (actual - /asetual/, where the ``e'' is probably due to bad alignment). The other bad examples may be better attributed to bad alignments (as were all extra inserted vowels).


next up previous
Next: Discussion Up: Unit selection without a Previous: Cluster based unit selection
Alan W Black 2002-10-01