Skip to content

Latest commit

 

History

History
378 lines (368 loc) · 6.6 KB

fabbri2020.md

File metadata and controls

378 lines (368 loc) · 6.6 KB

Fabbri 2020

This dataset contains expert and Turker annotations for summaries on the CNN/DailyMail dataset as collected in [1]. The setup command will save the summaries and references for all of the systems and their corresponding annotations and input documents. See this Github repository for more details.

sacrerouge setup-dataset fabbri2020 <output-dir>

The output files are the following:

  • summaries.jsonl: The model output summaries with their input documents and the ground-truth references
  • summaries-with-crowd.jsonl: The model output summaries with their input documents and the ground-truth and ten crowdsourced references
  • metrics.jsonl: The expert and Turker annotations that correspond to summaries.jsonl and summaries-with-crowd.jsonl
  • all-summaries-preproc-refs.jsonl.gz: All of the model outputs across the entire CNN/DM test dataset. The corresponding reference was maintained for each model output, which is some preprocessed version of the original references that appear in summaries.jsonl. That is, the outputs are grouped by the instance_id, but each instance_id may have many different references due to model preprocessing differences.
  • all-summaries-orig-refs.jsonl.gz: All of the model outputs across the entire CNN/DM test dataset. This version uses the documents and references as extracted by the huggingface CNN/DM scripts. The documents and references should be common across the same instance_id.

For all-summaries-preproc-refs.jsonl.gz and all-summaries-orig-refs.jsonl.gz, the aligned system outputs have duplicate instances. We only keep the first occurrence of any instance and ensure that the summary which was judged is selected.

Notes:

  • The raw data does not identify which reference summary is the original ground-truth reference, but after checking a handful of instances, it appears as if it is always the first reference in the list of references. That first reference is the one included in summaries.jsonl. (Confirmed)
  • To make the crowd summaries distinct, each is given a summarizer_id of turker- followed by a number from 1 to 10. It is not necessarily the case that the summaries identified by turker-i were all written by the same person and should not be treated as such.

Correlations

Here are the correlations of some of the metrics implemented in this library to the responsiveness scores in this dataset.

Single-reference, summary-level

Fabbri2020
r p k
R1-P 0.13 0.12 0.09
R1-R 0.31 0.28 0.23
R1-F1 0.28 0.26 0.20
R2-P 0.15 0.13 0.09
R2-R 0.26 0.23 0.18
R2-F1 0.23 0.19 0.14
BERTScore-P 0.17 0.17 0.13
BERTScore-R 0.37 0.35 0.27
BERTScore-F1 0.29 0.28 0.22
MoverScore 0.28 0.24 0.18
QAEval-EM 0.23 0.23 0.19
QAEval-F1 0.30 0.29 0.22

Single-reference, system-level

Fabbri2020
r p k
R1-P 0.29 0.15 0.03
R1-R 0.55 0.56 0.42
R1-F1 0.61 0.62 0.50
R2-P 0.49 0.41 0.25
R2-R 0.65 0.78 0.57
R2-F1 0.64 0.60 0.43
BERTScore-P 0.18 0.11 0.02
BERTScore-R 0.84 0.91 0.75
BERTScore-F1 0.54 0.40 0.28
MoverScore 0.56 0.54 0.42
QAEval-EM 0.80 0.91 0.77
QAEval-F1 0.82 0.91 0.77

Multi-reference, summary-level

Fabbri2020
r p k
R1-P 0.13 0.14 0.10
R1-R 0.33 0.29 0.23
R1-F1 0.36 0.33 0.25
R2-P 0.20 0.21 0.16
R2-R 0.34 0.31 0.24
R2-F1 0.33 0.29 0.22
BERTScore-P 0.18 0.19 0.14
BERTScore-R 0.42 0.38 0.29
BERTScore-F1 0.31 0.31 0.24
MoverScore 0.33 0.27 0.21
QAEval-EM 0.33 0.29 0.22
QAEval-F1 0.40 0.35 0.27

Multi-reference, system-level

Fabbri2020
r p k
R1-P 0.03 0.08 0.02
R1-R 0.38 0.30 0.23
R1-F1 0.55 0.77 0.58
R2-P 0.34 0.26 0.13
R2-R 0.41 0.29 0.23
R2-F1 0.57 0.64 0.43
BERTScore-P 0.13 0.14 0.05
BERTScore-R 0.80 0.85 0.70
BERTScore-F1 0.41 0.48 0.38
MoverScore 0.46 0.36 0.30
QAEval-EM 0.60 0.58 0.43
QAEval-F1 0.62 0.65 0.48

References

[1] Fabbri, Alexander R and Kryscinski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir. "SummEval: Re-evaluating Summarization Evaluation". 2020