GitHub - 0xnu/llm-benchmark: Test and compare different large language models on various tasks.

llm-benchmark

Test and compare different large language models on various tasks.

Tasks

Code Generation
Mathematical Reasoning
Creative Writing
Data Analysis
Logical Reasoning
Summarisation
Technical Explanation
Problem Solving

Install Dependencies

Use the package manager pip to install following

## Prerequisites
python3 -m venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
python3 -m pip install --upgrade pip

LLM Benchmark

## Set Environment Variables
OPENAI_API_KEY="your_api_key"
ANTHROPIC_API_KEY="your_api_key"
XAI_API_KEY="your_api_key"
DEEPSEEK_TOKEN="your_api_key"
GOOGLE_API_KEY="your_api_key"
MOONSHOT_API_KEY="your_api_key"
OPENROUTER_API_KEY="your_api_key"

## View Setup Guide
python3 -m scripts.benchmark setup

## Execute The Benchmark
python3 -m scripts.benchmark

## Deactivate Virtual Environment
deactivate

Results

Generated results on the 13th July 2025.

================================================================================
LLM BENCHMARK SUMMARY
================================================================================
                                    Avg Quality  Quality Std  Avg Latency  Med Latency  Avg Cost  Total Cost  Error Rate
model_name                                                                                                              
Anthropic-claude-sonnet-4-20250514       90.977       11.390       15.438       11.114     0.019       0.298        0.00
DeepSeek-deepseek-chat                   89.502       11.510       34.283       28.485     0.001       0.013        0.00
Google-gemini-2.5-pro                    77.065       26.758       21.421       21.223     0.002       0.026        6.25
Moonshot-moonshot-v1-8k                  91.278       10.618        7.345        6.967     0.001       0.014        0.00
OpenAI-gpt-4.1                           90.628       10.935        7.645        5.598     0.010       0.156        0.00
Qwen-qwen3-32b                           48.991       45.993       33.438       37.540     0.000       0.000        0.00
xAI-grok-4-0709                          95.341        7.110       23.227       21.574     0.026       0.409        0.00
================================================================================

Best Overall Quality: xAI-grok-4-0709
Fastest Response: Moonshot-moonshot-v1-8k
Most Cost-Effective: Qwen-qwen3-32b

📁 All results saved to results directory
🏆 Best Overall Model: xAI-grok-4-0709
📈 Overall Average Quality: 83.4%
💰 Total Cost: $0.9164
⚡ Average Latency: 20.40s

License

This project is licensed under the Modified MIT License.

Citation

@misc{llmbenchmark,
  author       = {Oketunji, A.F.},
  title        = {LLM Benchmark},
  year         = 2025,
  version      = {0.0.5},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.15875079},
  url          = {https://doi.org/10.5281/zenodo.15875079}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.devcontainer		.devcontainer
.github		.github
logs		logs
results		results
scripts		scripts
.gitignore		.gitignore
AGI.md		AGI.md
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm-benchmark

Tasks

Install Dependencies

LLM Benchmark

Results

License

Citation

Copyright

About

Uh oh!

Releases 5

Packages

Uh oh!

Languages

License

0xnu/llm-benchmark

Folders and files

Latest commit

History

Repository files navigation

llm-benchmark

Tasks

Install Dependencies

LLM Benchmark

Results

License

Citation

Copyright

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Languages

Packages