Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/FAC shared tasks. However, modem systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic because metrics disagree yet we can't decide which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over which metrics to trust. This would also be greatly beneficial to further improve summarization systems and metrics alike.

Chattez avec Graph Search

Posez n’importe quelle question sur les cours, conférences, exercices, recherches, actualités, etc. de l’EPFL ou essayez les exemples de questions ci-dessous.

AVERTISSEMENT : Le chatbot Graph n'est pas programmé pour fournir des réponses explicites ou catégoriques à vos questions. Il transforme plutôt vos questions en demandes API qui sont distribuées aux différents services informatiques officiellement administrés par l'EPFL. Son but est uniquement de collecter et de recommander des références pertinentes à des contenus que vous pouvez explorer pour vous aider à répondre à vos questions.

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

Graph Chatbot

Chattez avec Graph Search

Modal engineering of electromagnetic circuits to achieve rapid settling times

Modeling, Optimization and Design of High Power Medium Frequency Transformers

Modeling Compact Intracloud Discharge (CID) as a Streamer Burst

Modal engineering of electromagnetic circuits to achieve rapid settling times

Modeling, Optimization and Design of High Power Medium Frequency Transformers

Modeling Compact Intracloud Discharge (CID) as a Streamer Burst