With the first experiment I carried out, I inadvertently stumbled upon what turned out to be the hot topic in NLG: System Evaluation. At INLG2006 in Sydney, a special session on evaluation gave me an idea of the different positions that exist in the research community on whether and how NLG systems should be subjected to common evaluation schemes, metrics or even competitions, as is usual in other fields of NLP such as Machine Translation, Parsing, Text Summarisation, and Question Answering.
Opinions on these questions differed widely, with some people pushing for a shared task evaluation competition (stec) for NLG (e.g. Belz & Kilgariff 2006 and Reiter & Belz 2006), and others calling for caution in view of the special characteristics of NLG and the diversity of application domains and purposes (e.g. Paris et al. 2006 and Scott & Moore 2006).
There is no question that most or even all NLG tasks, inlcuding REG, have a number of characteristics that make comparative evaluation schemes and automatic metrics problematic. They re-appear in the recent literature on the topic, a lot of which can be found in this small Bibliography on Evaluation of Referring Expression Generation and NLG.
The most prominent issues arising for automatic comparative evaluation of REG systems include
the lack of one definite gold standard of human language usage against which to compare system output.
the lack of a clear definition of the expected system output in terms of both its linguistic level and the number of descriptions per object.
the danger of stifiling research by concentrating too many of the few NLG researchers on the same subtask.
the intricate interplay between the underlying knowledge representation and the actual algortihm running each system.
In 2009, the REG tasks were part of the wider Generation Challenges 2009, which also included the GIVE Challenge for instruction giving in virtual environments.
my related resources
I have compiled a small Bibliography on Evaluation of Referring Expression Generation and NLG that lists some literature and position papers as well as a few older papers with no claim to completeness and up-to-dateness.
Have a look at my "Do They Do What People Do?" experiment evaluating three existing REG algorithms against a small corpus of human-produced descriptions, which you are welcome to download and use for your own evaluations.
I have collected two more data set of referring expressions, the GRE3D3 and GRE3D7 corpora, in the context of my work on spatial relations in referring expressions, which I'm happy for people to use.
my related writings
Jette Viethen and Robert Dale (2006). Towards the evaluation of referring expression generation. In Proceedings of the 4th Australasian Language Technology Workshop, Sydney, Australia. [ pdf | poster ]
Jette Viethen and Robert Dale (2007). Evaluation in natural language generation: lessons from referring expression generation. In Traitement Automatique des Langues 48(1), special edition "Principles of Evaluation in Natural Language Processing". [ pdf ]
Jette Viethen (2007). Automatic evaluation of referring expression generation is possible. Position paper for the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation, Arlington VA, USA. [ pdf | slides ]
Jette Viethen (2007). Referring expressions: from the good enough to the most useful. Position paper presented at the Research Workshop on Generating Referring Expressions that Are Optimal for Hearers, University of Aberdeen, Scotland. [ pdf | slides ]
Mariët Theune, Pascal Touset, Jette Viethen, and Emiel Krahmer (2007). Cost-based attribute selection for GRE (GRAPH-SC/ GRAPH-FP). In Proceedings of the UCNLG+MT at MT Summit XI, Copenhagen, Denmark. [ pdf ]
Emiel Krahmer, Mariët Theune, Jette Viethen, and Iris Hendrickx (2008). GRAPH: The costs of redundancy in referring expressions. In Proceedings of the 5th International Conference on ANtural Language Generation, Salt Fork OH, USA. [ pdf | poster ]
Anja Belz, Eric Kow, Jette Viethen and Albert Gatt (2008). The GREC Challenge: Overview and Evaluation Results. In Proceedings of the 5th International Conference on Natural Language Generation, Salt Fork OH, USA. [ pdf ]