We use a standard HMM-based tagging framework as is commonly found in a number of systems (e.g. [DeRose, 1988]). This model consists of two parts: an n-gram model for part of speech sequences and a likelihood distribution model of part of speech tags for words. These parts are combined using Bayes Theorem and the Viterbi algorithm is used to find the most probable part of speech sequence given a set of words.
Although the Spoken English Corpus is marked with POS tags, this corpus has too few words to train a HMM POS tagger. Instead we used the Penn Treebank [Marcus et al., 1993] which consists of around 1.2 million words from the the Wall Street Journal (WSJ). Apart from size, we do not think that the two corpora are significantly different with respect to POS behaviour. The words in the WSJ data are tagged automatically with subsequent hand correction. Punctuation is reduced to a single tag giving us a base tagset of K=37. A generic, unknown word POS distribution is made from the POS distributions of a set of less frequent words and there is a special distribution for words containing just digits. The tagger correctly tagged 94.4% of the words of an independent test set of 113,000 words.
The parameters in our POS sequence model are calculated from POS tag occurrences and it is clear that while the full tagset may potentially be the most discriminative, it also leads to sparse data problems. A series of experiments found that a tagset of size 23 was the overall best. The reduction in size can be carried out in two ways, either by mapping the output of the tagger onto a smaller tagset, or by training the tagger on the smaller set. We found that when we post-mapped the tagset the performance was 97.0%, while training and testing on the reduced set gave a worse figure of 96.2%. Hence we always use the full tagset for POS tagging purposes and reduce the size of the set afterwards.