experiment.evaluate() shows stale evaluation results #79

davidtan-tw · 2023-08-17T04:17:02Z

🐛 Describe the bug

Hi folks,

Thanks again for your work on this library.

I noticed an issue where similarity scores do not get updated when I change my expected fields. Only when I re-run the experiment are the values updated.

Bug

Steps to reproduce:

models = ["gpt-3.5-turbo", "gpt-3.5-turbo-0613"]
messages = [
    [
        {"role": "system", "content": "Who is the first president of the US? Give me only the name"},
    ]
]
temperatures = [0.0]

experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
experiment.run()
experiment.visualize()

from prompttools.utils import semantic_similarity

experiment.evaluate("similar_to_expected", semantic_similarity, expected=["George Washington"] * 2)
experiment.visualize()

from prompttools.utils import semantic_similarity

experiment.evaluate("similar_to_expected", semantic_similarity, expected=["Lady Gaga"] * 2)
experiment.visualize() # the evaluation results here indicate that "Lady Gaga" is semantically identical to "George Washington"

In my opinion, evaluate() should re-compute metrics every time it is run, rather than depending/being coupled to another function (run()). I haven't tested it on other eval_fns, but it could be worth testing if this is the case as well.

The text was updated successfully, but these errors were encountered:

NivekT · 2023-08-17T06:06:31Z

Your observation is correct. Currently, if a metric already exists (which is "similar_to_expected" in your case), it raises a warning (as seen in your notebook "WARNING: similar_to_expected is already present, skipping") rather than overwriting it.

If you change the metric name given in the second .evaluate call (i.e. experiment.evaluate("similar_to_expected_2", ...)), it will compute another column.

We are open to considering overwriting it even when the existing metric already exists. Let us know what you think.

Sruthi5797 · 2023-10-04T12:14:03Z

Thank you for this issue, I changed the variable name but still, the response column is stale. Any leads on this issue? I use python version 3.11.5

NivekT · 2023-10-04T15:04:14Z

Hi @Sruthi5797,

Can you post a minimal code snippet of what you are running? Also, are you seeing any warning message?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment.evaluate() shows stale evaluation results #79

experiment.evaluate() shows stale evaluation results #79

davidtan-tw commented Aug 17, 2023 •

edited

Loading

NivekT commented Aug 17, 2023 •

edited

Loading

Sruthi5797 commented Oct 4, 2023 •

edited

Loading

NivekT commented Oct 4, 2023 •

edited

Loading

experiment.evaluate() shows stale evaluation results #79

experiment.evaluate() shows stale evaluation results #79

Comments

davidtan-tw commented Aug 17, 2023 • edited Loading

🐛 Describe the bug

Bug

Steps to reproduce:

NivekT commented Aug 17, 2023 • edited Loading

Sruthi5797 commented Oct 4, 2023 • edited Loading

NivekT commented Oct 4, 2023 • edited Loading

davidtan-tw commented Aug 17, 2023 •

edited

Loading

NivekT commented Aug 17, 2023 •

edited

Loading

Sruthi5797 commented Oct 4, 2023 •

edited

Loading

NivekT commented Oct 4, 2023 •

edited

Loading