Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Fix the precision calculation in context-precision metric doc #685

Merged

Conversation

amit-timalsina
Copy link
Contributor

Precision@1 = 0/1 != 1. It should be 0.
This PR fixes the calculation of Precision@1 and subsequently \text{Context Precision} = {\text{(0+0.5)} \over \text{2}} = 0.25

image

@shahules786 shahules786 merged commit 0b54876 into explodinggradients:main Feb 29, 2024
4 checks passed
joy13975 pushed a commit to joy13975/ragas that referenced this pull request Mar 4, 2024
…xplodinggradients#685)

Precision@1 = 0/1 != 1. It should be 0.
This PR fixes the calculation of Precision@1 and subsequently
`\text{Context Precision} = {\text{(0+0.5)} \over \text{2}} = 0.25`


![image](https://github.com/explodinggradients/ragas/assets/30175128/c92cc0ec-516b-4d62-b85a-2802dfafaa6a)

Co-authored-by: Amit Timalsina <amit@kniru.com>
shahules786 pushed a commit that referenced this pull request Mar 4, 2024
Eventhough @amit-timalsina made a fix to the Context Precision docs a
few day ago in #685 the description is still wrong in multiple ways.

1) The total number of **relevant items** is used, not all items. That
is, in the example calculation the denominator needs to be 1 not 2,
leading to a result of 0.5 not 0.25.

2) The formula for Context Precision@k does not show the calculation how
it is actually done in code:

```python
def _calculate_average_precision(self, json_responses: t.List[t.Dict]) -> float:
        score = np.nan
        json_responses = [
            item if isinstance(item, dict) else {} for item in json_responses
        ]
        verdict_list = [
            int("1" == resp.get("verdict", "").strip())
            if resp.get("verdict")
            else np.nan
            for resp in json_responses
        ]
        denominator = sum(verdict_list) + 1e-10
        numerator = sum(
            [
                (sum(verdict_list[: i + 1]) / (i + 1)) * verdict_list[i]
                for i in range(len(verdict_list))
            ]
        )
        score = numerator / denominator
        if np.isnan(score):
            logger.warning(
                "Invalid response format. Expected a list of dictionaries with keys 'verdict'"
            )
        return score
```

There a weighted precision@k is used based on the relevance indicator:
```python
     (sum(verdict_list[: i + 1]) / (i + 1)) * verdict_list[i]
     for i in range(len(verdict_list))
```

Otherwise results greater 1 would be possible e.g. for verdict_list =
[1,0].


You can easily verify this with this examples:
```python
from ragas.metrics import ContextPrecision
from datasets import Dataset
from ragas import evaluate

import os
os.environ["OPENAI_API_KEY"] = "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"
os.environ["RAGAS_DO_NOT_TRACK"] = "true"

context_precision = ContextPrecision()

questions = [
    "Where is France and what is it’s capital?",
    "Where is France and what is it’s capital?",
    "Where is France and what is it’s capital?",
    "Where is France and what is it’s capital?", 
  ]

ground_truths = [
  "France is in Western Europe and its capital is Paris.",
  "France is in Western Europe and its capital is Paris.",
  "France is in Western Europe and its capital is Paris.",
  "France is in Western Europe and its capital is Paris.",

  ]

contexts =[
    [   # all bad
        "The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and", 
        "The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and", 
    ],
    [ # wrong order
        "The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and", 
        "France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower",
    ],
    [ # right order
        "France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower",
        "The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and", 

    ],
    [ # all good
        "France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower",
        "France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower",
    ]
]

data = {
    "question": questions,
    "contexts": contexts,
    "ground_truth": ground_truths
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)
dataset


result = evaluate(
    dataset = dataset, 
    metrics=[
        context_precision,
    ],
)


print(result.to_pandas())
``` 
```
                                   question  ... context_precision
0  Where is France and what is it’s capital?  ...               0.0
1  Where is France and what is it’s capital?  ...               0.5
2  Where is France and what is it’s capital?  ...               1.0
3  Where is France and what is it’s capital?  ...               1.0
```

3) The `ground_truth` is used as well to calculate the metric.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants