Awesome Foundation Model Leaderboard is a curated list of awesome foundation model leaderboards (for an explanation of what a leaderboard is, please refer to this post), along with various development tools and evaluation organizations according to our survey:
On the Workflows and Smells of Leaderboard Operations (LBOps):
An Exploratory Study of Foundation Model Leaderboards
Zhimin (Jimmy) Zhao, Abdul Ali Bangash, Filipe Roseiro Côgo, Bram Adams, Ahmed E. Hassan
Software Analysis and Intelligence Lab (SAIL)
If you find this repository useful, please consider giving us a star ⭐ and citation:
@article{zhao2024workflows,
title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards},
author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E},
journal={arXiv preprint arXiv:2407.04065},
year={2024}
}
Additionally, we provide a search toolkit that helps you quickly navigate through the leaderboards.
If you want to contribute to this list (please do), welcome to propose a pull request.
If you have any suggestions, critiques, or questions regarding this list, welcome to raise an issue.
Also, a leaderboard should be included if only:
- It is actively maintained.
- It is related to foundation models.
Name | Description |
---|---|
gradio_leaderboard | gradio_leaderboard helps users build fully functional and performant leaderboard demos with gradio. |
Demo leaderboard | Demo leaderboard helps users easily deploy their leaderboards with a standardized template. |
Leaderboard Explorer | Leaderboard Explorer helps users navigate the diverse range of leaderboards available on Hugging Face Spaces. |
open_llm_leaderboard | open_llm_leaderboard helps users access Open LLM Leaderboard data easily. |
open-llm-leaderboard-renamer | open-llm-leaderboard-renamer helps users rename their models in Open LLM Leaderboard easily. |
Open LLM Leaderboard Results PR Opener | Open LLM Leaderboard Results PR Opener helps users showcase Open LLM Leaderboard results in their model cards. |
Open LLM Leaderboard Scraper | Open LLM Leaderboard Scraper helps users scrape and export data from Open LLM Leaderboard. |
Name | Description |
---|---|
Allen Institute for AI | Allen Institute for AI is a non-profit research institute with the mission of conducting high-impact AI research and engineering in service of the common good. |
Papers With Code | Papers With Code is a community-driven platform for learning about state-of-the-art research papers on machine learning. |
Name | Description |
---|---|
CompassRank | CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation mdoels for the industry and research. |
FlagEval | FlagEval is a comprehensive platform for evaluating foundation models. |
GenAI-Arena | GenAI-Arena hosts the visual generation arena, where various vision models compete based on their performance in image generation, image edition, and video generation. |
Holistic Evaluation of Language Models | Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models. |
nuScenes | nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car. |
SuperCLUE | SuperCLUE is a series of benchmarks for evaluating Chinese foundation models. |
Name | Description |
---|---|
ACLUE | ACLUE is an evaluation benchmark for ancient Chinese language comprehension. |
AIR-Bench | AIR-Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models. |
AlignBench | AlignBench is a multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. |
AlpacaEval | AlpacaEval is an automatic evaluator designed for instruction-following LLMs. |
ANGO | ANGO is a generation-oriented Chinese language model evaluation benchmark. |
Arabic Tokenizers Leaderboard | Arabic Tokenizers Leaderboard compares the efficiency of LLMs in parsing Arabic in its different dialects and forms. |
Arena-Hard-Auto | Arena-Hard-Auto is a benchmark for instruction-tuned LLMs. |
Auto-Arena | Auto-Arena is a benchmark in which various language model agents engage in peer-battles to evaluate their performance. |
BeHonest | BeHonest is a benchmark to evaluate honesty - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs. |
BenBench | BenBench is a benchmark to evaluate the extent to which LLMs conduct verbatim training on the training set of a benchmark over the test set to enhance capabilities. |
BiGGen-Bench | BiGGen-Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks. |
Biomedical Knowledge Probing Leaderboard | Biomedical Knowledge Probing Leaderboard aims to track, rank, and evaluate biomedical factual knowledge probing results in LLMs. |
BotChat | BotChat assesses the multi-round chatting capabilities of LLMs through a proxy task, evaluating whether two ChatBot instances can engage in smooth and fluent conversation with each other. |
C-Eval | C-Eval is a Chinese evaluation suite for LLMs. |
C-Eval Hard | C-Eval Hard is a more challenging version of C-Eval, which involves complex LaTeX equations and requires non-trivial reasoning abilities to solve. |
Capability leaderboard | Capability leaderboard is a platform to evaluate long context understanding capabilties of LLMs. |
Chain-of-Thought Hub | Chain-of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs. |
ChineseFactEval | ChineseFactEval is a factuality benchmark for Chinese LLMs. |
CLEM | CLEM is a framework designed for the systematic evaluation of chat-optimized LLMs as conversational agents. |
CLiB | CLiB is a benchmark to evaluate Chinese LLMs. |
CMMLU | CMMLU is a Chinese benchmark to evaluate LLMs' knowledge and reasoning capabilities. |
CMB | CMB is a multi-level medical benchmark in Chinese. |
CMMLU | CMMLU is a benchmark to evaluate the performance of LLMs in various subjects within the Chinese cultural context. |
CMMMU | CMMMU is a benchmark to test the capabilities of multimodal models in understanding and reasoning across multiple disciplines in the Chinese context. |
CompMix | CompMix is a benchmark for heterogeneous question answering. |
Compression Leaderboard | Compression Leaderboard is a platform to evaluate the compression performance of LLMs. |
CoTaEval | CoTaEval is a benchmark to evaluate the feasibility and side effects of copyright takedown methods for LLMs. |
ConvRe | ConvRe is a benchmark to evaluate LLMs' ability to comprehend converse relations. |
CriticBench | CriticBench is a benchmark to evaluate LLMs' ability to make critique responses. |
CRM LLM Leaderboard | CRM LLM Leaderboard is a platform to evaluate the efficacy of LLMs for business applications. |
DecodingTrust | DecodingTrust is an assessment platform to evaluate the trustworthiness of LLMs. |
Domain LLM Leaderboard | Domain LLM Leaderboard is a platform to evaluate the popularity of domain-specific LLMs. |
DyVal | DyVal is a dynamic evaluation protocol for LLMs. |
Enterprise Scenarios leaderboard | Enterprise Scenarios Leaderboard aims to assess the performance of LLMs on real-world enterprise use cases. |
EQ-Bench | EQ-Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs. |
Factuality Leaderboard | Factuality Leaderboard compares the factual capabilities of LLMs. |
FuseReviews | FuseReviews aims to advance grounded text generation tasks, including long-form question-answering and summarization. |
FELM | FELM is a meta benchmark to evaluate factuality evaluation benchmark for LLMs. |
GAIA | GAIA aims to test fundamental abilities that an AI assistant should possess. |
GPT-Fathom | GPT-Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. |
Guerra LLM AI Leaderboard | Guerra LLM AI Leaderboard compares and ranks the performance of LLMs across quality, price, performance, context window, and others. |
Hallucinations Leaderboard | Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs. |
HalluQA | HalluQA is a benchmark to evaluate the phenomenon of hallucinations in Chinese LLMs. |
HellaSwag | HellaSwag is a benchmark to evaluate common-sense reasoning in LLMs. |
HHEM Leaderboard | HHEM Leaderboard evaluates how often a language model introduces hallucinations when summarizing a document. |
IFEval | IFEval is a benchmark to evaluate LLMs' instruction following capabilities with verifiable instructions. |
Indic LLM Leaderboard | Indic LLM Leaderboard is a benchmark to track progress and rank the performance of Indic LLMs. |
InstructEval | InstructEval is an evaluation suite to assess instruction selection methods in the context of LLMs. |
Japanese Chatbot Arena | Japanese Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Japanese. |
JustEval | JustEval is a powerful tool designed for fine-grained evaluation of LLMs. |
Ko Chatbot Arena | Ko Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Korean. |
KoLA | KoLA is a benchmark to evaluate the world knowledge of LLMs. |
L-Eval | L-Eval is a Long Context Language Model (LCLM) evaluation benchmark to evaluate the performance of handling extensive context. |
Language Model Council | Language Model Council (LMC) is a benchmark to evaluate tasks that are highly subjective and often lack majoritarian human agreement. |
LawBench | LawBench is a benchmark to evaluate the legal capabilities of LLMs. |
LogicKor | LogicKor is a benchmark to evaluate the multidisciplinary thinking capabilities of Korean LLMs. |
Long In-context Learning Leaderboard | Long In-context Learning Leaderboard is a platform to evaluate long in-context learning evaluations for LLMs. |
LAiW | LAiW is a benchmark to evaluate Chinese legal language understanding and reasoning. |
LLM Benchmarker Suite | LLM Benchmarker Suite is a benchmark to evaluate the comprehensive capabilities of LLMs. |
LLM Leaderboard | LLM Leaderboard is a platform to evaluate LLMs in the Chinese context. |
LLM Leaderboard (en) | LLM Leaderboard (en) is a platform to evaluate LLMs in the English context. |
LLM Safety Leaderboard | LLM Safety Leaderboard aims to provide a unified evaluation for language model safety. |
LLM-Leaderboard | LLM-Leaderboard is a joint community effort to create one central leaderboard for LLMs. |
LLM-Perf Leaderboard | LLM-Perf Leaderboard aims to benchmark the performance of LLMs with different hardware, backends, and optimizations. |
LLMs Disease Risk Prediction Leaderboard | LLMs Disease Risk Prediction Leaderboard is a platform to evaluate LLMs on disease risk prediction. |
LLMEval | LLMEval is a benchmark to evaluate the quality of open-domain conversations with LLMs. |
LLMHallucination Leaderboard | Hallucinations Leaderboard evaluates LLMs based on an array of hallucination-related benchmarks. |
LLMPerf | LLMPerf is a tool to evaluate the performance of LLMs using both load and correctness tests. |
LMSYS Chatbot Arena Leaderboard | LMSYS Chatbot Arena Leaderboard hosts the chatbot arena, where various LLMs compete based on their performance in English. |
LongBench | LongBench is a benchmark for assessing the long context understanding capabilities of LLMs. |
LucyEval | LucyEval offers a thorough assessment of LLMs' performance in various Chinese contexts. |
M3KE | M3KE is a massive multi-level multi-subject knowledge evaluation benchmark to measure the knowledge acquired by Chinese LLMs. |
MINT | MINT is a benchmark to evaluate LLMs' ability to solve tasks with multi-turn interactions by using tools and leveraging natural language feedback. |
MedBench | MedBench is a benchmark to evaluate the mastery of knowledge and reasoning abilities in medical LLMs. |
Meta Open LLM leaderboard | The Meta Open LLM leaderboard serves as a central hub for consolidating data from various open LLM leaderboards into a single, user-friendly visualization page. |
Mistral ChatBot Arena | Mistral ChatBot Arena hosts the chatbot arena, where various LLMs compete based on their performance in chatting. |
MixEval | MixEval is a benchmark to evaluate LLMs via by strategically mixing off-the-shelf benchmarks. |
ML.ENERGY Leaderboard | ML.ENERGY Leaderboard evaluates the energy consumption of LLMs. |
MMLU | MMLU is a benchmark to evaluate the performance of LLMs across a wide array of natural language understanding tasks. |
MMLU-by-task Leaderboard | MMLU-by-task Leaderboard provides a platform for evaluating and comparing various ML models across different language understanding tasks. |
MMLU-Pro | MMLU-Pro is a more challenging version of MMLU to evaluate the reasoning capabilities of LLMs. |
ModelScope LLM Leaderboard | ModelScope LLM Leaderboard is a platform to evaluate LLMs objectively and comprehensively. |
MSTEB | MSTEB is a benchmark for measuring the performance of text embedding models in Spanish. |
MTEB | MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks across 112 languages. |
MY Malay LLM Leaderboard | MY Malay LLM Leaderboard aims to track, rank, and evaluate open LLMs on Malay tasks. |
MY Malaysian Embedding Leaderboard | MY Malaysian Embedding Leaderboard measures and ranks the performance of text embedding models on diverse embedding tasks in Malay. |
NoCha | NoCha is a benchmark to evaluate how well long-context language models can verify claims written about fictional books. |
NPHardEval | NPHardEval is a benchmark to evaluate the reasoning abilities of LLMs through the lens of computational complexity classes. |
Occiglot Euro LLM Leaderboard | Occiglot Euro LLM Leaderboard compares LLMs in terms of four main languages from the Okapi benchmark and Belebele (French, Italian, German, Spanish and Dutch). |
OlympicArena | OlympicArena is a benchmark to evaluate the advanced capabilities of LLMs across a broad spectrum of Olympic-level challenges. |
oobabooga | Oobabooga is a benchmark to perform repeatable performance tests of LLMs with the oobabooga web UI. |
Open-LLM-Leaderboard | Open-LLM-Leaderboard evaluates LLMs in terms of language understanding and reasoning by transitioning from multiple-choice questions (MCQs) to open-style questions. |
Open-source Model Fine-Tuning Leaderboard | Open-source Model Fine-Tuning Leaderboard is a platform to rank and showcase models that have been fine-tuned using open-source datasets or frameworks. |
OpenEval | OpenEval is a multidimensional and open evaluation system to assess Chinese LLMs. |
OpenLLM Turkish leaderboard | OpenLLM Turkish leaderboard tracks progress and ranks the performance of LLMs in Turkish. |
Open Arabic LLM Leaderboard | Open Arabic LLM Leaderboard tracks progress and ranks the performance of LLMs in Arabic. |
Open Dutch LLM Evaluation Leaderboard | Open Dutch LLM Evaluation Leaderboard tracks progress and ranks the performance of LLMs in Dutch. |
Open ITA LLM Leaderboard | Open ITA LLM Leaderboard tracks progress and ranks the performance of LLMs in Italian. |
Open Ko-LLM Leaderboard | Open Ko-LLM Leaderboard tracks progress and ranks the performance of LLMs in Korean. |
Open LLM Leaderboard | Open LLM Leaderboard tracks progress and ranks the performance of LLMs in English. |
Open Medical-LLM Leaderboard | Open Medical-LLM Leaderboard aims to track, rank, and evaluate open LLMs in the medical domain. |
Open MLLM Leaderboard | Open MLLM Leaderboard aims to track, rank and evaluate LLMs and chatbots. |
Open MOE LLM Leaderboard | OPEN MOE LLM Leaderboard assesses the performance and efficiency of various Mixture of Experts (MoE) LLMs. |
Open Multilingual LLM Evaluation Leaderboard | Open Multilingual LLM Evaluation Leaderboard tracks progress and ranks the performance of LLMs in multiple languages. |
Open PL LLM Leaderboard | Open PL LLM Leaderboard is a platform for assessing the performance of various LLMs in Polish. |
Open PT LLM Leaderboard | Open PT LLM Leaderboard tracks progress and ranks the performance of LLMs in Portuguese. |
OR-Bench | OR-Bench is a benchmark to evaluate the over-refusal of enhanced safety in LLMs. |
Powered-by-Intel LLM Leaderboard | Powered-by-Intel LLM Leaderboard evaluates, scores, and ranks LLMs that have been pre-trained or fine-tuned on Intel Hardware. |
PubMedQA | PubMedQA is a benchmark to evaluate biomedical research question answering. |
PromptBench | PromptBench is a benchmark to evaluate the robustness of LLMs on adversarial prompts. |
QuALITY | QuALITY is a benchmark for evaluating multiple-choice question-answering with a long context. |
RABBITS | RABBITS is a benchmark to evaluate the robustness of LLMs by evaluating their handling of synonyms, specifically brand and generic drug names. |
RedTeam Arena | RedTeam Arena is a red-teaming platform for LLMs. |
Red Teaming Resistance Benchmark | Red Teaming Resistance Benchmark is a benchmark to evaluate the robustness of LLMs against red teaming prompts. |
Reviewer Arena | Reviewer Arena hosts the reviewer arena, where various LLMs compete based on their performance in critiquing academic papers. |
Robust Reading Competition | Robust Reading refers to the research area on interpreting written communication in unconstrained settings. |
RoleEval | RoleEval is a bilingual benchmark to evaluate the memorization, utilization, and reasoning capabilities of role knowledge of LLMs. |
Safety Prompts | Safety Prompts is a benchmark to evaluate the safety of Chinese LLMs. |
SafetyBench | SafetyBench is a benchmark to evaluate the safety of LLMs. |
SALAD-Bench | SALAD-Bench is a benchmark for evaluating the safety and security of LLMs. |
ScandEval | ScandEval is a benchmark to evaluate LLMs on tasks in Scandinavian languages as well as German, Dutch, and English. |
SciKnowEval | SciKnowEval is a benchmark to evaluate LLMs based on their proficiency in studying extensively, enquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. |
SCROLLS | SCROLLS is a benchmark to evaluate the reasoning capabilities of LLMs over long texts. |
SeaExam | SeaExam is a benchmark to evaluate LLMs for Southeast Asian (SEA) languages. |
SEAL | SEAL is an expert-driven private evaluation platform for LLMs. |
SeaEval | SeaEval is a benchmark to evaluate the performance of multilingual LLMs in understanding and reasoning with natural language, as well as comprehending cultural practices, nuances, and values. |
Spec-Bench | Spec-Bench is a benchmark to evaluate speculative decoding methods across diverse scenarios. |
SuperBench | SuperBench is a comprehensive evaluation system of tasks and dimensions to assess the overall capabilities of LLMs. |
SuperGLUE | SuperGLUE is a benchmark to evaluate the performance of LLMs on a set of challenging language understanding tasks. |
SuperLim | SuperLim is a benchmark to evaluate the language understanding capabilities of LLMs in Swedish. |
Swahili LLM-Leaderboard | Swahili LLM-Leaderboard is a joint community effort to create one central leaderboard for LLMs. |
T-Eval | T-Eval is a benchmark for evaluating the tool utilization capability of LLMs. |
TAT-DQA | TAT-DQA is a benchmark to evaluate LLMs on the discrete reasoning over documents that combine both structured and unstructured information. |
TAT-QA | TAT-QA is a benchmark to evaluate LLMs on the discrete reasoning over documents that combines both tabular and textual content. |
The Pile | The Pile is a benchmark to evaluate the world knowledge and reasoning ability of LLMs. |
TOFU Leaderboard | TOFU Leaderboard is a benchmark to evaluate the unlearning performance of LLMs in realistic scenarios. |
Science Leaderboard | Science Leaderboard is a platform to evaluate LLMs' capabilities to solve science problems. |
Toloka LLM Leaderboard | Toloka LLM Leaderboard is a benchmark to evaluate LLMs based on authentic user prompts and expert human evaluation. |
Toolbench | ToolBench is a platform for training, serving, and evaluating LLMs specifically for tool learning. |
Toxicity Leaderboard | Toxicity Leaderboard evaluates the toxicity of LLMs. |
Trustbit LLM Leaderboards | Trustbit LLM Leaderboards is a platform that provides benchmarks for building and shipping products with LLMs. |
TrustLLM | TrustLLM is a benchmark to evaluate the trustworthiness of LLMs. |
UGI Leaderboard | UGI Leaderboard measures and compares the uncensored and controversial information known by LLMs. |
ViDoRe | ViDoRe is a benchmark to evaluate retrieval models on their capacity to match queries to relevant documents at the page level. |
VLLMs Leaderboard | VLLMs Leaderboard aims to track, rank and evaluate open LLMs and chatbots. |
Xiezhi | Xiezhi is a benchmark for holistic domain knowledge evaluation of LLMs. |
Yet Another LLM Leaderboard | Yet Another LLM Leaderboard is a platform for tracking, ranking, and evaluating open LLMs and chatbots. |
Name | Description |
---|---|
AesBench | AesBench is a benchmark to evaluate multimodal LLMs (MLLM) on image aesthetics perception. |
BLINK | BLINK is a benchmark to evaluate the core visual perception abilities of MLLMs. |
CCBench | CCBench is a benchmark to evaluate the multi-modal capabilities of MLLMs specifically related to Chinese culture. |
CharXiv | CharXiv is a benchmark to evaluate chart understanding capabilities of MLLMs. |
ChEF | ChEF is a benchmark to evaluate MLLMs across various visual reasoning tasks. |
ConTextual | ConTextual is a benchmark to evaluate MLLMs across context-sensitive text-rich visual reasoning tasks. |
CORE-MM | CORE-MM is a benchmark to evaluate the open-ended visual question-answering (VQA) capabilities of MLLMs. |
DreamBench++ | DreamBench++ is a human-aligned benchmark automated by multimodal models for personalized image generation. |
EgoPlan-Bench | EgoPlan-Bench is a benchmark to evaluate planning abilities of MLLMs in real-world, egocentric scenarios. |
GlitchBench | GlitchBench is a benchmark to evaluate the reasoning capabilities of MLLMs in the context of detecting video game glitches. |
HallusionBench | HallusionBench is a benchmark to evaluate the image-context reasoning capabilities of MLLMs. |
InfiMM-Eval | InfiMM-Eval is a benchmark to evaluate the open-ended VQA capabilities of MLLMs. |
LRVS-Fashion | LRVS-Fashion is a benchmark to evaluate LLMs regarding image similarity search in fashion. |
LVLM Leaderboard | LVLM Leaderboard is a platform to evaluate the visual reasoning capabilities of MLLMs. |
M3CoT | M3CoT is a benchmark for multi-domain multi-step multi-modal chain-of-thought of MLLMs. |
Mementos | Mementos is a benchmark to evaluate the reasoning capabilities of MLLMs over image sequences. |
MJ-Bench | MJ-Bench is a benchmark to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. |
MLLM-Bench | MLLM-Bench is a benchmark to evaluate the visual reasoning capabilities of MLVMs. |
MMBench | MMBench is a benchmark to evaluate the visual reasoning capabilities of MLLMs. |
MME | MME is a benchmark to evaluate the visual reasoning capabilities of MLLMs. |
MMMU | MMMU is a benchmark to evaluate the performance of multimodal models on tasks that demand college-level subject knowledge and expert-level reasoning across various disciplines. |
MMStar | MMStar is a benchmark to evaluate the multi-modal capacities of MLLMs. |
MMT-Bench | MMT-Bench is a benchmark to evaluate MLLMs across a wide array of multimodal tasks that require expert knowledge as well as deliberate visual recognition, localization, reasoning, and planning. |
Multimodal Hallucination Leaderboard | Multimodal Hallucination Leaderboard compares MLLMs based on hallucination levels in various tasks. |
MULTI | MULTI is a benchmark to evaluate MLLMs on understanding complex tables and images, and reasoning with long context. |
MultiTrust | MultiTrust is a benchmark to evaluate the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. |
NPHardEval4V | NPHardEval4V is a benchmark to evaluate the reasoning abilities of MLLMs through the lens of computational complexity classes. |
OCRBench | OCRBench is a benchmark to evaluate the OCR capabilities of multimodal models. |
Open CoT Leaderboard | Open CoT Leaderboard tracks LLMs' abilities to generate effective chain-of-thought reasoning traces. |
Open Parti Prompts Leaderboard | Open Parti Prompts Leaderboard compares text-to-image models to each other according to human preferences. |
PCA-Bench | PCA-Bench is a benchmark to evaluate the embodied decision-making capabilities of multimodal models. |
Q-Bench | Q-Bench is a benchmark to evaluate the visual reasoning capabilities of MLLMs. |
RewardBench | RewardBench is a benchmark to evaluate the capabilities and safety of reward models. |
ScienceQA | ScienceQA is a benchmark to evaluate the multi-hop reasoning ability and interpretability of AI systems in the context of science question answering. |
SciGraphQA | SciGraphQA is a benchmark to evaluate the MLLMs in scientific graph question-answering. |
SEED-Bench | SEED-Bench is a benchmark to evaluate the text and image generation of multimodal models. |
UnlearnCanvas | UnlearnCanvas is a stylized image benchmark to evaluate machine unlearning for diffusion models. |
UnlearnDiffAtk | UnlearnDiffAtk is a benchmark to evaluate the robustness of safety-driven unlearned diffusion models (DMs) (i.e., DMs after unlearning undesirable concepts, styles, or objects) across a variety of tasks. |
URIAL Bench | URIAL Bench is a benchmark to evaluate the capacity of language models for alignment without introducing the factors of fine-tuning (learning rate, data, etc.), which are hard to control for fair comparisons. |
UPD Leaderboard | UPD Leaderboard is a platform to evaluate the trustworthiness of MLLMs in unsolvable problem detection. |
Vibe-Eval | Vibe-Eval is a benchmark to evaluate MLLMs for challenging cases. |
VideoHallucer | VideoHallucer is a benchmark to detect hallucinations in MLLMs. |
VisIT-Bench | VisIT-Bench is a benchmark to evaluate the instruction-following capabilities of MLLMs for real-world use. |
WHOOPS! | WHOOPS! is a benchmark to evaluate the visual commonsense reasoning abilities of MLLMs. |
WildBench | WildBench is a benchmark for evaluating language models on challenging tasks that closely resemble real-world applications. |
WildVision Arena Leaderboard | WildVision Arena Leaderboard hosts the chatbot arena, where various MLLMs compete based on their performance in visual understanding. |
Name | Description |
---|---|
Aider LLM Leaderboards | Aider LLM Leaderboards evaluate LLM's ability to follow system prompts to edit code. |
Berkeley Function Calling Leaderboard | Berkeley Function Calling Leaderboard evaluates the ability of LLMs to call functions (also known as tools) accurately. |
BigCodeBench | BigCodeBench is a benchmark for code generation with practical and challenging programming tasks. |
Big Code Models Leaderboard | Big Code Models Leaderboard to assess the performance of LLMs on code-related tasks. |
BIRD | BIRD is a benchmark to evaluate the performance of text-to-SQL parsing systems. |
CanAiCode Leaderboard | CanAiCode Leaderboard is a platform to assess the code generation capabilities of LLMs. |
ClassEval | ClassEval is a benchmark to evaluate LLMs on class-level code generation. |
Code Lingua | Code Lingua is a benchmark to compare the ability of code models to understand what the code implements in source languages and translate the same semantics in target languages. |
Coding LLMs Leaderboard | Coding LLMs Leaderboard is a platform to evaluate and rank LLMs across various programming tasks. |
CRUXEval | CRUXEval is a benchmark to evaluate code reasoning, understanding, and execution capabilities of LLMs. |
CyberSafetyEval | CYBERSECEVAL is a benchmark to evaluate the cybersecurity of LLMs as coding assistants. |
DevOps-Eval | DevOps-Eval is a benchmark to evaluate code models in the DevOps/AIOps field. |
DS-1000 | DS-1000 is a meta benchmark to evaluate code generation models in data science tasks. |
EffiBench | EffiBench is a benchmark to evaluate the efficiency of LLMs in code generation. |
EvalPlus | EvalPlus is a benchmark to evaluate the code generation performance of LLMs. |
EvoCodeBench | EvoCodeBench is an evolutionary code generation benchmark aligned with real-world code repositories. |
EvoEval | EvoEval is a benchmark to evaluate the coding abilities of LLMs, created by evolving existing benchmarks into different targeted domains. |
BigCodeBench | BigCodeBench is a benchmark for code generation with practical and challenging programming tasks. |
InfiBench | InfiBench is a benchmark to evaluate code models on answering freeform real-world code-related questions. |
InterCode | InterCode is a benchmark to standardize and evaluate interactive coding with execution feedback. |
LiveCodeBench | LiveCodeBench is a benchmark to evaluate code models across code-related scenarios over time. |
Long Code Arena | Long Code Arena is a suite of benchmarks for code-related tasks with large contexts, up to a whole code repository. |
NaturalCodeBench | NaturalCodeBench is a benchmark to mirror the complexity and variety of scenarios in real coding tasks. |
Nexus Function Calling Leaderboard | Nexus Function Calling Leaderboard is a platform to evaluate code models on performing function calling and API usage. |
Program Synthesis Models Leaderboard | Program Synthesis Models Leaderboard provides a ranking and comparison of open-source code models based on their performance. |
RepoQA | RepoQA is a benchmark to evaluate the long-context code understanding ability of LLMs. |
Spider | Spider is a benchmark to evaluate the performance of natural language interfaces for cross-domain databases. |
StableToolBench | StableToolBench is a benchmark to evaluate tool learning that aims to provide a well-balanced combination of stability and reality. |
SWE-bench | SWE-bench is a benchmark for evaluating LLMs on real-world software issues collected from GitHub. |
Name | Description |
---|---|
MathBench | MathBench is a multi-level difficulty mathematics evaluation benchmark for LLMs. |
MathEval | MathEval is a benchmark to evaluate the mathematical capabilities of LLMs. |
MathVerse | MathVerse is a benchmark to evaluate vision-language models in interpreting and reasoning with visual information in mathematical problems. |
MathVista | MathVista is a benchmark to evaluate mathematical reasoning in visual contexts. |
Open Multilingual Reasoning Leaderboard | Open Multilingual Reasoning Leaderboard tracks and ranks the reasoning performance of LLMs on multilingual mathematical reasoning benchmarks. |
SciBench | SciBench is a benchmark to evaluate the reasoning capabilities of LLMs for solving complex scientific problems. |
TabMWP | TabMWP is a benchmark to evaluate LLMs in mathematical reasoning tasks that involve both textual and tabular data. |
We-Math | We-Math is a benchmark to evaluate the human-like mathematical reasoning capabilities of LLMs with problem-solving principles beyond the end-to-end performance. |
Name | Description |
---|---|
AutoEval-Video | AutoEval-Video is a benchmark to evaluate the capabilities of video models in the context of open-ended video question answering. |
LongVideoBench | LongVideoBench is a benchmark to evaluate the capabilities of video models in answering referred reasoning questions, which are dependent on long frame inputs and cannot be well-addressed by a single frame or a few sparse frames. |
MLVU | MLVU is a benchmark to evaluate video models in multi-task long video understanding. |
MMToM-QA | MMToM-QA is a multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds. |
MVBench | MVBench is a benchmark to evaluate the temporal understanding capabilities of video models in dynamic video tasks. |
VBench | VBench is a benchmark to evaluate video generation capabilities of video models. |
Video-Bench | Video-Bench is a benchmark to evaluate the video-exclusive understanding, prior knowledge incorporation, and video-based decision-making abilities of video models. |
Video-MME | Video-MME is a benchmark to evaluate the video analysis capabilities of video models. |
VNBench | VNBench is a benchmark to evaluate the fine-grained understanding and spatio-temporal modeling capabilities of video models. |
Name | Description |
---|---|
Agent CTF Leaderboard | Agent CTF Leaderboard is a platform to evaluate the performance of LLM-driven agents in the field of cybersecurity, particularly CTF (capture the flag) competition issues. |
AgentBench | AgentBench is the benchmark to evaluate language model-as-Agent across a diverse spectrum of different environments. |
AgentStudio | AgentStudio is an integrated solution featuring in-depth benchmark suites, realistic environments, and comprehensive toolkits. |
LLM Colosseum Leaderboard | LLM Colosseum Leaderboard is a platform to evaluate LLMs by fighting in Street Fighter 3. |
TravelPlanner | TravelPlanner is a benchmark to evaluate LLM agents in tool use and complex planning within multiple constraints. |
VisualWebArena | VisualWebArena is a benchmark to evaluate the performance of multimodal web agents on realistic visually grounded tasks. |
WebArena | WebArena is a standalone, self-hostable web environment to evaluate autonomous agents. |
Name | Description |
---|---|
MY Malaysian Speech-to-Text Leaderboard | MY Malaysian Speech-to-Text (STT) Leaderboard aims to track, rank and evaluate Malaysian STT models. |
Open ASR Leaderboard | Open ASR Leaderboard provides a platform for tracking, ranking, and evaluating Automatic Speech Recognition (ASR) models. |
TTS Arena | TTS-Arena hosts the Text To Speech (TTS) arena, where various TTS models compete based on their performance in generating speech. |
Name | Description |
---|---|
3D Arena | 3D Arena hosts 3D generation arena, where various 3D generative models compete based on their performance in generating 3D models. |
3D-POPE | 3D-POPE is a benchmark to evaluate object hallucination in 3D generative models. |
3DGen-Arena | 3DGen-Arena hosts the 3D generation arena, where various 3D generative models compete based on their performance in generating 3D models. |
BOP | BOP is a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image. |
GPTEval3D Leaderboard | GPTEval3D Leaderboard check how MLLMs understand 3D content via multi-view images as input. |
Name | Description |
---|---|
Artificial Analysis | Artificial Analysis is a platform to help users make informed decisions on AI model selection and hosting providers. |
Papers Leaderboard | Papers Leaderboard is a platform to evaluate the popularity of machine learning papers. |
Provider Leaderboard | LLM API Providers Leaderboard is a platform to compare API provider performance for over LLM endpoints across performance key metrics. |
Name | Description |
---|---|
DataComp - CLIP | DataComp - CLIP is a benchmark to evaluate the performance of various image/text pairs when used with a fixed model architecture. |
DataComp - LM | DataComp - CLIP is a benchmark to evaluate the performance of various text datasets when used with a fixed model architecture. |
Name | Description |
---|---|
AlignScore | AlignScore evaluates the performance of different metrics in assessing factual consistency. |
Name | Description |
---|---|
Open Leaderboards Leaderboard | Open Leaderboards Leaderboard is a meta-leaderboard that leverages human preferences to compare machine learning leaderboards. |