👀 What?

This repository contains code for using the $d_{HM}$ evaluation method proposed in:
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition—In proceedings of EMNLP 2024 (Findings).

Note: Despite being proposed specifically for visual storytelling, this method is generalizable and can be extended to any task involving model-generated outputs with corresponding references.

🤔 Why?

$d_{HM}$ enables human-centric evaluation of model-generated stories along different dimensions important for visual story generation.

🤖 How?

$d_{HM}$ combines three reference-free evaluation metrics—GROOViST¹ (for visual grounding), RoViST-C² (for coherence), and RoViST-NR² (for non-redundancy/repetition)—by computing the average of absolute metric-level deviations between human stories and corresponding model generations.

Setup

Install python (e.g., version 3.11) and other dependencies provided under requirements.txt, e.g., using:
pip install -r requirements.txt

Step 0: Generate stories

For generating stories using the models and settings proposed in this work, refer to this documentation.

Step 1A: Compute metric-level scores for human stories

For computing visual grounding scores (G), checkout the GROOViST repository.

For computing coherence (C) and repetition (R) scores, use the following utility adapted from RoViST. E.g.,
python evaluate/eval_C_R.py -i ./data/stories/vist/gt_test.json -o ./data/scores/vist/gt_test

Note 1: Download the pre-trained ALBERT model from here and place it under the data/ folder.

Note 2: Requirements differ—checkout the evaluate/requirements file.

Step 1B: Compute metric-level scores for model-generated stories

Similar to Step 1A.

Step 2: Evaluate using $d_{HM}$

For obtaining aggregate $d_{HM}$ values along with corresponding metric-level distances ($d_{HM}^G, d_{HM}^C, d_{HM}^R$), use the following utility. E.g.,
python dHM.py -d VIST

🔗 If you find this work useful, please consider citing it:

@inproceedings{
   EMNLP 2024 Findings (to appear) 
}

Footnotes

https://aclanthology.org/2023.emnlp-main.202/ ↩
https://aclanthology.org/2022.findings-naacl.206/ ↩ ↩²

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

👀 What?

🤔 Why?

🤖 How?

Setup

Step 0: Generate stories

Step 1A: Compute metric-level scores for human stories

Step 1B: Compute metric-level scores for model-generated stories

Step 2: Evaluate using $d_{HM}$

Files

README.md

Latest commit

History

README.md

File metadata and controls

👀 What?

🤔 Why?

🤖 How?

Setup

Step 0: Generate stories

Step 1A: Compute metric-level scores for human stories

Step 1B: Compute metric-level scores for model-generated stories

Step 2: Evaluate using $d_{HM}$

Footnotes