NLP EVALUATION
Updated 218 days ago
Fair and adequate evaluations and comparisons are of fundamental importance to the NLP community to properly track progress, especially within the current deep learning revolution, with new state-of-the-art results reported in ever shorter intervals. This concerns the designing of adequate metrics for evaluating performance in high-level text generation tasks such as question and dialogue generation, summarization, machine translation, image captioning, poetry generation, etc.; properly evaluating word and sentence embeddings; and rigorously determining whether and under which conditions one system is better than another; etc... with desirable properties, e.g., (i) high correlations with humans; (ii) can distinguish high-quality outputs from mediocre / low-quality outputs; (iii) robustness across lengths of input and output sequences; (iv) speed; etc... cross-domain metrics that can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image..