Throughout we have given at least three performance figures (junctures-correct, breaks-correct and juncture-insertions) for each experiment. While these give a reasonably representative indicator of true performance, and have been invaluable in system development, they must be treated with caution. First of all, it is difficult to judge the relative importance of insertion and deletion errors. Our best performing systems often increase the breaks-correct figure to the detriment of the juncture-insertion figure. From listening to the output, we believe that this is an acceptable tradeoff, and the best systems from a perceptual point of view are the ones which have the highest breaks-correct while maintaining a high junctures-correct score.
Although formal listening tests might give a better indication of the relative importance of different systems, they are extremely costly compared to the automatic scoring techniques used here. However, it should be noted that a certain amount of ``perceptual tuning'' can be carried out by scaling the phrase break and POS sequence model individually. Thus the relative importance of the two components in equation 7 can be controlled. The scaling affects the relative numbers of insertions and deletions, and hence the amount of scaling can be set according to which type of error is deemed most significant.
The scoring technique compares the system's placement of phrase breaks with that of a human labeller and so it is important to ask how consistently humans label phrase breaks. Unfortunately, no consistency figures are available for our data, but we think it is safe to assume that the labelling consistency is about the same as that measured by Pitrelli et al tobi_comparison, who obtained a breaks-correct figure of 92.5% for a similar task. This effectively defines the upper limit one can expect from an automatic system.
It is also important to note that not all errors are of equal importance. Partly this is due to the fact that speakers don't always place breaks in the same place: some junctures can take either breaks or non-breaks without sounding odd, while other junctures must always be of the same type. Ostendorf and Veilleux ostendorf:94 proposed a solution to this by having the text in the test set spoken by several different speakers. Sometimes all the speakers agreed, sometimes not. By comparing the output of their system with each instance of the test sentences, it was possible to assess if an error under the usual criterion was actually in a potentially acceptable place. Unfortunately such a comparison measure is not available to us, as we only have a single version of each sentence in the test set.