Talking clocks are good as toy examples, and for debugging the process but there aren't many applications that require such a closed domain. The question we need to address is how this technique performs on larger domains. As with general unit selection synthesizers, it is clear than when its works the quality is excellent, but what must be more properly investigated is how often this technique fails and how badly. As we are proposing a system that doesn't just offer high quality synthesis, but also a method for building such voices we also must test the reliability of building voices.
We devised a simple weather report system that downloaded weather reports for named US cities from weather.gov. This is a simple slot filling template problem with the template of the form
The weather at, HOUR, on DAY DATE, outlook OUTLOOK, TEMPERATURE degrees, winds WINDIRECTION, WINDSPEED (with gusts to WINDSPEED).We generated 250 utterances of this type, looping through values for the slots e.g.
The weather at 1 A.M., on Sunday January 1. outlook cloudy, 20 degrees, winds, North 2 miles per hour.The first hundred were recorded and used to build a limited domain synthesizer as described above. The second hundred were used to find problems that were then fixed by correcting the automatic labeling. The final 50 utterances were used for testing alone.
Once recorded, it takes less than an hour to build the basic voice on a 500 MHz Pentium III running Linux. Then, less than a day was spent by one person on fixing problems; most of that time was spent doing a visual check over all the phone labels. The second set of one hundred test sentences were used as a diagnostic test. Of the problems found, most were minor segmental labeling errors, though three errors we found where the speaker said a different word from the prompt, ``west'' for ``east'' and ``pm'' for ``am'' (twice). The autolabeller can (unfortunately) cope with such mismatches but of course this causes a problem when semantically different but phonetically similar utterances are spoken from what is requested. However, as pointed out above this robustness is also sometimes valuable.
The 50 held-out test sentences were then evaluated, both with the fully automatic, but uncorrected labeling, and then the corrected form. Three categories were identified, correct where no notable errors in synthesis were heard, minor where some notable glitch in synthesis occurs (but the sentence is still fully understandable), and wrong where a semantic error occurs (wrong word) or the synthesis has a major problem that affects understandability.
Correct | Minor | Wrong | |
Automatic | 60% | 32% | 8% |
Corrected | 90% | 10% | 0% |
This experiment implies that we do have a relatively robust system for reliably building new voices in a very short time.