How NOT to Say It: On the need for human evaluation of what to AVOID in realization ranking Abstract for INLG-06 Open Mic Session Michael White Dept. Linguistics, The Ohio State University http://www.ling.ohio-state.edu/~mwhite/ Surface realization is a natural task for comparative evaluation. During the open mic session, I will argue that a comparative evaluation of different approaches to surface realization should involve human judgements of quality (adequacy and fluency) in context, and furthermore, should focus on the extent to which different realizers can rank possible outputs in accord with human judgements of acceptability and unacceptability. I will suggest that efforts that evaluate quality solely based on string accuracy or BLEU scores, such as (Langkilde-Geary, 2002) and (Velldal and Oepen, 2005), do not give us an adequate picture of their effectiveness. In support of this position, I will cite (Callison-Burch, Osborne and Koehn, 2006), who caution against relying too heavily on BLEU even for MT research, where output quality is typically much lower than in NLG systems. I will also highlight the need to be able to generate a range of acceptable paraphrases, while avoiding problematic ones, in order to optimize the selection of what to say based on how well it is likely to be synthesized (Nakatsu and White, 2006); exhibit personality and alignment in dialogue (Isard, Brockmann and Oberlander, 2006); or produce more interestingly varied multimodal output (Stone et al, 2004; Foster and Oberlander, 2006). References Chris Callison-Burch, Miles Osborne and Philipp Koehn, 2006. Re-evaluating the Role of Bleu in Machine Translation Research. In Proc. EACL-06. Mary Ellen Foster and Jon Oberlander, 2006. Data-driven generation of emphatic facial displays. In Proc. EACL-06. Amy Isard, Carsten Brockmann and Jon Oberlander, 2006. Individuality and Alignment in Generated Dialogues. To appear in Proc. INLG-06. Irene Langkilde-Geary, 2002. An Empirical Verification of Coverage and Correctness for a General-Purpose Sentence Generator. In Proc. INLG-02. Crystal Nakatsu and Michael White, 2006. Learning to Say It Well: Reranking Realizations by Predicted Synthesis Quality. To appear in Proc. ACL-06. Matthew Stone, Doug DeCarlo, Insuk Oh, Christian Rodriguez, Adrian Stere, Alyssa Lees and Chris Bregler, 2004. Speaking with hands: Creating Animated Conversational Characters from Recordings of Human Performance. ACM Transactions on Graphics 23(3) (SIGGRAPH). Erik Velldal and Stephan Oepen, 2005. Maximum Entropy Models for Realization Ranking. In Proc. of the 10th MT-Summit (X).