One of the first things that struck me when I started reading into the literature on Referring Expression Generation algorithms was that there is no universal definition of what constitutes a "good" referring expression. The greatest common denominator of the things the different algorithms aim at is that the descriptions should be distinguishing in context.
Apart from that the algorithm designer is pretty much free to decide what the descriptions the algorithm produces look like. The two most common goals are to generate referring expressions that are as short as possible or as naturalistic as possible. Immediately the problem of determining what "naturalistic" means pokes its ugly head at us and it almost never seems to mean the same thing.
In other CL fields such as Machine Translation there are clear assessment metrics that compare the performance of a new system to some gold standard. This gold standard usually has something to do with how humans perform the same task.
In GRE there is no such thing and most algorithms have never been tested against real data. This applies especially for the older algorithms, some of which now have established themselves as something like the foundations of most new approaches to GRE.
As a result of this observation, the first thing I did for my PhD was an attempt to assess the output of some of these more established algorithms against a data set of human-produced referring expressions. The data set consists of descriptions of drawers in a 4x4 grid of filing cabinets. Here are a picture of the filing cabinets and a few sample descriptions:
d10: The blue drawer in the third row
d1: The blue drawer in the top left corner
d11: The yellow drawer that's the second from the bottom in the filing cabinet that's the second from the right
d12: The orange drawer below the pink drawer
I presented a paper about this work at INLG06:
Jette Viethen and Robert Dale (2006). Algorithms for generating referring expressions: Do they do what people do? In Proceedings of the 4th International Conference on Natural Language Generation, 63-70, Sydney, Australia. [ pdf | slides ]
You can also download the complete Drawer Data set, including pictures of the drawer domain.