Using a Decision Tree For Quoted-Speech Type Identification

Next: Character Identification in a Up: ESPER: architecture Previous: Identifying Quoted Speech Types

Using a Decision Tree For Quoted-Speech Type Identification

We trained a decision tree (CART) to identify the aforementioned types of quoted speech using local feature information in the story text. The collection of training data consisted of 16 children's stories taken from works by Hans Christian Andersen and Lewis Carroll, with a total of 1198 pieces of quoted speech. In order to ensure that the training data are correctly labeled, we performed a first approximation of quoted speech types over the training data using a naive rule such that if the first word in the quoted speech is not capitalized, then the quote is classified as type ``CONT''; otherwise it is classified as type ``NEW''. The resulting output from this initial pass was then hand-corrected to eliminate any incorrect type assignment resulting from the application of this rule. From this training data, we then extracted a number of features for each piece of quoted speech in order to train the decision tree. These features include:

the word string preceding the quote
the word string succeeding the quote
capitalization of the first word in the quote
whether the quote starts the paragraph,
whether the quote is the first quote in the paragraph,
the end-punctuation of the previous quote within the same paragraph,
the punctuation of the word preceding the quote
the length (number of words) of the gap between the last quote and the current one (within the same paragraph).

The performance of the decision tree after training was as follows:

**Table 2:** *Performance of Decision Tree on Identifying Type of Quoted Speech*
New	Cont.
98.8%	82.6%

From examining the tree it is interesting to notice that the feature which serves as the most reliable predictor of quoted-speech types is the capitalization feature. Even though intuitively, other features, such as the punctuation of the previous token before the quote, might also seem like good predictors of quote types, statistically they were deemed to be less reliable.

Next: Character Identification in a Up: ESPER: architecture Previous: Identifying Quoted Speech Types

Alan W Black 2003-10-20