LLM Hallucination Evaluation

YurtsAI developed a pipeline to evaluate the famous hallucination problem of large language models. Refer to Illusions Unraveled: The Magic and Madness of Hallucinations in LLMs — Part 1 to learn more about hallucinations and the evaluation pipeline.

🔧 Setup

Requirements:

Python 3.10+
Poetry
🤗HuggingFace token

Firstly, create a virtual environment and activate it:

python3.10 -m virtualenv .venv
source .venv/bin/activate

To install the required dependencies, assuming you have poetry installed, run:

It also logs you into 🤗Hub which prompts for your 🤗Hub token.

make install

or in dev mode:

make install-dev

⚙️ Evaluation

To evaluate the model on the given TechCrunch dataset, run:

python -m llm_eval \
    --model_name_or_path tiiuae/falcon-7b-instruct \
    --max_length 512 \
    --data_max_size 100 \
    --num_proc 4 \
    --batch_size 8 \
    --compute_reward

For more information, run llm_eval --help or python -m llm_eval --help.

Some models have different input formatting, i.e addition of special token or formatted in certain ways. To handle this, you can use the --input_format flag. For example, to preprocess the input for the OpenAssistant/falcon-7b-sft-mix-2000 model, run:

python -m llm_eval \
    --model_name_or_path OpenAssistant/falcon-7b-sft-mix-2000 \
    --data_max_size 100 \
    --input_format "<|prompter|>{}<|endoftext|><|assistant|>" \
    --batch_size 8 \
    --shuffle \
    --max_length 512 \
    --compute_reward

📊 Visualize

If you'd like further data exploration, you can use pandas or your favorite data analysis library to visualize the data.

If you're not familiar with pandas, you can use the following snippet. Make sure to pip install pandas first.

>>> import pandas as pd

>>> # Load the data to a pandas dataframe.
>>> df = pd.read_json('res/eval/falcon-7b-instruct_tech-crunch.jsonl', lines=True)

>>> # Filter Type-1 hallucinations.
>>> good = df[df.reward == 1]
>>> neutral = df[df.reward == 0]
>>> bad = df[df.reward < 0]

>>> # Get the number of good, neutral, and bad responses.
>>> n, n_good, n_neutral, n_bad = len(df), len(good), len(neutral), len(bad)

>>> print(f'Good: {n_good} ({n_good / n:.2%})')
>>> print(f'Neutral: {n_neutral} ({n_neutral / n:.2%})')
>>> print(f'Bad: {n_bad} ({n_bad / n:.2%})')

You're welcome to submit a pull request with your visualizations!

🧑‍💻 Contribution

You are very welcome to modify and use them in your own projects.

Please keep a link to the original repository. If you have made a fork with substantial modifications that you feel may be useful, then please open a new issue on GitHub with a link and short description.

⚖️ License (MIT)

This project is opened under the MIT which allows very broad use for both private and commercial purposes.

A few of the images used for demonstration purposes may be under copyright. These images are included under the "fair usage" laws.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
res		res
src/llm_eval		src/llm_eval
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Hallucination Evaluation

🔧 Setup

⚙️ Evaluation

📊 Visualize

🧑‍💻 Contribution

⚖️ License (MIT)

About

Releases

Packages

Languages

License

YurtsAI/llm-hallucination-eval

Folders and files

Latest commit

History

Repository files navigation

LLM Hallucination Evaluation

🔧 Setup

⚙️ Evaluation

📊 Visualize

🧑‍💻 Contribution

⚖️ License (MIT)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages