Individual Variation in the Choice of Referential Form

This study aims to measure the variation be-tween writers in their choices of referential form by collecting and analysing a new and publicly available corpus of referring expressions. The corpus is composed of referring expressions produced by different participants in identical situations. Results, measured in terms of normalized entropy, reveal substantial individual variation. We discuss the problems and prospects of this ﬁnding for automatic text generation applications


Introduction
Automatic text generation is the process of automatically converting data into coherent text -practical applications range from weather reports (Goldberg et al., 1994) to neonatal intensive care reports (Portet et al., 2009). One important way to achieve coherence in texts is by generating appropriate referring expressions throughout the text (Krahmer and van Deemter, 2012). In this generation process, the choice of referential form is a crucial task (Reiter and Dale, 2000): when referring to a person or object in a text, should the system use a proper name ("Phillip Anschutz"), a definite description ("the American entrepreneur") or a pronoun ("he")?
Despite the large amount of algorithms developed for deciding upon the form of a referring expression (Callaway and Lester, 2002;Greenbacker and Mc-Coy, 2009;Gupta and Bandopadhyay, 2009;Orȃsan and Dornescu, 2009;Greenbacker et al., 2010), it is difficult to know how well these algorithms actually perform. Typically, such algorithms are eval-uated against a corpus of human written texts, predicting what form each reference should have in a given context. Now consider a situation in which the algorithm predicts that a reference should be a description, while this same reference is a pronoun in the corpus text. Should this count as an error? The answer is: it depends. The use of a pronoun does not necessarily mean that the use of a description is incorrect. In fact, other writers might have used a description as well.
In general, corpora of referring expressions have only one gold standard referential form for each situation, while different writers may conceivably vary in the referential form they would use. This complicates the development and evaluation of text generation algorithms, since these will typically attempt to predict the corpus gold standard, which may not always be representative of the choices of different writers. Although recent work in text generation has explored individual variation in the content determination of definite descriptions (Viethen and Dale, 2010;Ferreira and Paraboni, 2014), to the best of our knowledge this has not been systematically explored for choosing referential forms.
In this paper, we collect and analyze a new corpus to address this issue. In the collection, we presented different writers with texts in which all references to the main topic of the text have been replaced with gaps. The task of the participants was to fill each of those gaps with a reference to the topic. In the analysis, we estimated to what extent different writers agree with each other in terms of normalized entropy. In addition, we study whether this variation depends on the text genre, compar-ing encyclopedic texts with news and product reports. Moreover, we discuss the implications of our findings for automatic text generation, exploring whether factors such as syntactic structure, referential status and recency affect the variation between the writers' choices. The annotated corpus is made publicly available 1 .

Material
For our study, we used 36 English texts, equally distributed over three different genres: news texts, reviews of commercial products and encyclopedic texts. The encyclopedic texts were selected from the GREC corpus (Belz et al., 2010), which is a standard corpus for testing and evaluating models for choice of referential form. The news and review texts were selected from the AQUAINT-2 corpus 2 and the SFU Review corpus (Konstantinova et al., 2012), respectively.
Note that, depending on the genre, texts may address different kinds of topics. For instance, the news texts usually are about a person, a company or a group; the product reviews may be about a book, a movie or a phone; and the encyclopedic texts about a mountain, a river or a country. In all texts, all expressions referring to the topic were replaced with gaps, which the participants should fill in.

Participants
Participants were recruited through CrowdFlower 3 . 78 participants completed the survey. 53 were female and 25 were male. Their average age was 37 years old. Most were native speakers (73 participants) or fluent in English (5 participants).

Procedure
The participants were first presented with an introduction to the experiment, explaining the procedure and asking their consent. Next, they were asked for their age, demographic information and English language proficiency. After this, participants were randomly assigned to a list, containing 9 texts (3 per genre).
The task of the participants was to fill in each gap with a reference to the topic of the text. To inform the participants about the entities, a short description -extracted from the Wikipedia page about the topic -was provided before each text.
Participants were encouraged to fill in the gaps according to their preferences, so that they felt the texts would be easy to understand. We made sure that participants did not fill all the gaps in a text with only one referring expression (to avoid copy/paste behaviour). Participants could also not leave any gap empty (they were instructed to use the "-" symbol for empty references).

Annotation
The first author of this study annotated the referring expressions produced by participants for referential form, syntactic position, referential status, and recency. Coding was straightforward, and the few difficult cases were resolved in discussions between the co-authors.
Following the GREC Project scheme (Belz et al., 2010), referring expressions were annotated for three syntactic positions: subject noun phrases, object noun phrases, and genitive noun phrases that function as determiners (Google's stock). Referential status refers to whether a referring expression is a first mention to the topic (new) or not (old). We annotated this at the level of the text, paragraph and sentence, so that a reference can be new in paragraph, but old in the text. Recency, finally, is the distance between a given referring expression and the last, previous reference to the same topic, measured in terms of number of words within a paragraph. If the referring expression was the first mention to the topic in the paragraph, its recency is set to 0.
In total, 10,977 referring expressions were col-

Analysis
We measured variation between participants' choices for each gap, using the normalized entropy measure, defined in Equation 1, where X corresponds to the references in a given gap, and n = 5 the number of referential forms annotated.
The measure ranges from 0 to 1, where 0 indicates the complete agreement among the participants for a particular referential form, and 1 indicates the complete variation among their choices. Figure 1 presents the main result, depicting the amount of individual variation in referential forms, measured in terms of entropy, as a function of text genre. The averaged entropies are significantly higher than 0 for all three genres according to a Wilcoxon signed-rank test (News: V = 20, 910.0, p < .001; Reviews: V = 11, 476.0, p < .001; and Encyclopedic texts: V = 10, 153.0, p < .001). This clearly shows that different writers can vary substantially in their choices for a referential form. Com-paring the three different genres, we find that writers' choices of referential form varied most in review texts and least in news texts, with encyclopedic texts sandwiched in between (Kruskal-Wallis H = 70.73, p < .001).

Results
In comparison with the original texts, 44% of the referring expressions produced by the writers differ from the original ones in a same referential gap. Furthermore, the form of the original referring expressions differs from the major choice of the writers in 38% of the referential gaps.
To get a better understanding of factors potentially influencing individual variation, we investigate the effects of three linguistic factors: syntactic position, referential status and recency. Figure 2 depicts the average entropies for each of these.
Comparing the three syntactic positions, Figure  2a suggests that the highest variation is found when writers need to choose referential forms in the object position of a sentence, whereas the lowest variation is found for references that function as a genitive noun phrase determiner (Kruskal-Wallis H = 52.53, p < .001). Figure 2b depicts individual variation in the choice of referential form for old and new references in the text, paragraph and sentence. The data suggests a higher amount of individual variation when writers need to refer to a topic already mentioned in the text rather than a first mention (Mann-Whitney U = 3, 916.0, p < .001), presumably because for a topic which is new in the text, writers were more likely to agree to use proper names (91% of the choices). Looking at old and new references within paragraphs reveals no significant differences in individual variation (Mann-Whitney U = 32, 669.5, p < .094). At the sentence level, finally, there is more individual variation for references to a new topic than for references to a previously mentioned one (Mann-Whitney U = 21, 873.0, p < .001). When writers referred to a previously mentioned referent in the sentence, they tended to agree on the use of a pronoun (76% of the choices). Figure 2c shows the individual variation in referential form as a function of recency. Except for the relatively nearby intervals (between 0 and 10 words, and between 11 and 20 words), the data suggests that when the distance between two consecutive references gets larger, the variation among writ-

Discussion
In this paper, we studied individual variation in the choice of referential form by collecting a new (and publicly available) dataset in which different participants (writers) were asked to refer to the same referent throughout a text. This was done for different genres (news, product review and encyclopedic texts) by measuring the variation between participants in terms of normalized entropy. If participants would all use the same referential form in the same gap, we would expect entropy values of 0 (no individual variation), but instead we found a clearly different pattern in all three text genres. Moreover, we also saw a considerably difference in form among the original referring expressions and the ones generated by the participants. This reveals that substantial individual variation between writers exists in terms of referential form.
To get a better understanding of which factors influence individual variation, we analysed to what extent three linguistic factors had an impact on the entropy scores: syntactic position, referential status and recency. We found a higher amount of individual variation when writers had to choose referential forms in the direct object position, referring to previously mentioned topics in the text and first mentioned ones in the sentence, and references that were relatively distant from the most recent antecedent reference to the same topic.
These findings can be related to theories of reference involving the salience of a referent (Gundel et al., 1993;Grosz et al., 1995, among others). Brennan (1995), for example, argued that references in the role of the subject of a sentence are more likely to be salient than references in the role of the object. Chafe (1994), to give a second example, pointed out that references to previously mentioned referents in the discourse and ones that are close to their antecedent are more likely to be salient than references to new referents or ones that are distant from their antecedents. Note, incidentally, that none of these earlier studies address the issue of individual variation in referential form.
Arguably, the amount of individual variation is even larger than the data reported here suggest. To illustrate this, consider, for instance, that different participants referred to Phillip Frederick Anschutzthe main topic of one of the texts used -as Phillip Frederick Anschutz, Mr. Phillip Frederick Anschutz, Anschutz, Mr. Anschutz and Phillip Anschutz. Even though these all have the same referential form (proper names), there is also a lot of variation within this category. Indeed, it would be interesting in future research to explore which factors account for this within-form variation.
The current findings are important for automatic text generation algorithms in two ways. First, they are beneficial for developers of text generation systems, since they allow for a better understanding of the range of variation that is possible in referring expression generation. Second, they allow for a more principled evaluation of algorithms predicting referential form. In fact, the collected corpus paves the way for developing models which predict frequency distributions over referential forms, rather than merely predicting a single form in particular context (as current models do).