Next: Character Identification in a
Up: ESPER: architecture
Previous: Identifying Quoted Speech Types
We trained a decision tree (CART) to identify the
aforementioned types of quoted speech using local feature information
in the story text. The collection of training data consisted of 16
children's stories taken from works by Hans Christian Andersen and
Lewis Carroll, with a total of 1198 pieces of quoted speech. In order to
ensure that the training data are correctly labeled, we performed a
first approximation of quoted speech types over the training data using a naive rule such
that if the first word in the quoted speech is not capitalized, then
the quote is classified as type ``CONT''; otherwise it is classified as type ``NEW''. The
resulting output from this initial pass was then hand-corrected to
eliminate any incorrect type assignment resulting from the application of this
rule. From this training data, we then extracted a number of
features for each piece of quoted speech in order to train the
decision tree. These features include:
- the word string preceding the quote
- the word string succeeding the quote
- capitalization of the first word in the quote
- whether the quote starts the paragraph,
- whether the quote is the first quote in the paragraph,
- the end-punctuation of the previous quote within the same paragraph,
- the punctuation of the word preceding the quote
- the length (number of words) of the gap between the last quote and the current one (within the same paragraph).
The performance of the decision tree after training was as follows:
Table 2:
Performance of Decision Tree on Identifying Type of Quoted Speech
New |
Cont. |
98.8% |
82.6% |
From examining the tree it is interesting to notice that the feature which serves as the most
reliable predictor of quoted-speech types is the capitalization
feature. Even
though intuitively, other features, such as the punctuation of the
previous token before the quote, might also seem like good predictors
of quote types, statistically they were deemed to be less reliable.
Next: Character Identification in a
Up: ESPER: architecture
Previous: Identifying Quoted Speech Types
Alan W Black
2003-10-20