KOHb - Getty Images

Tip

Benchmarking LLMs: A guide to AI model evaluation

LLM benchmarks provide a starting point for evaluating generative AI models across a range of different tasks. Learn where these benchmarks can be useful, and where they're lacking.

Matt Heusser

By

Matt Heusser, Excelon Development

Published: 20 May 2025

Large language models seem to be a double-edged sword.

While they can answer questions -- including questions on how to create code and test it -- the answers to those questions are not always reliable. With so many large language models (LLMs) to choose from, teams might wonder which is right for their organization and how they stack up against each other. LLM benchmarks promise to help evaluate LLMs and provide insights that inform this choice.

Traditional software quality metrics consider amounts of memory, speed, processing power or energy use. LLM benchmarks are different -- they aim to measure problem-solving capabilities. Public discourse about various LLM tools sometimes holds their ability to pass high school exams or some law class tests as evidence of their problem-solving ability or overall quality.

Still, excellent results on a test that already exists in an LLM's training data -- or out on the internet somewhere -- just means the tool is good at pattern-matching, not general problem-solving. The logic for math conversions, counting letters in words or predictive sentence composition are all very different. Benchmarks address this by attempting to create an objective score for a certain type of problem -- with scores changing all the time. Use this breakdown of a few key benchmarks and model rankings to make a best-educated choice.

What are LLM benchmarks?

LLM benchmarks are standardized frameworks that assess LLM performance. They provide a set of tasks for the LLM to accomplish, rate the LLM's ability to achieve that task against specific metrics, then produce a score based on the metrics. LLM benchmarks cover different capabilities such as coding, reasoning, text summarization, reading comprehension and factual recall. Essentially, they're an objective way to measure a model's competency in solving a specific type of problem reliably.

Organizations can use benchmarks to compare performance on a certain task type across different models and fine-tune models to improve their performance.

How do LLM benchmarks work?

Benchmark execution can be broken down into three main steps:

Setup. Teams prepare the data that the benchmark will use to evaluate the system's performance. This might include text documents, sets of coding questions or math questions for the LLM to solve.
Testing. Then, teams test the model on the sample data, using a zero-shot, few-shot or fine-tuned method, depending on how much labeled data the testing team wants to give the LLM before testing it.
Scoring. Teams measure the model's performance against expected output according to metrics such as accuracy, recall and perplexity. These metrics are often synthesized in a score from 0-100.

Visual showing a list of direct and indirect AI KPIs. — Model benchmark performance is scored according to a collection of metrics that evaluate both functional and nonfunctional attributes.

Like LLMs, humans take tests to establish how well they understand a body of knowledge. These tests can cover the following:

A specific body of knowledge, such as the Bar exam for lawyers.
Broad general competency, such as the GED.
The ability to solve a certain kind of problem, such as the math portion of college entrance exams.

LLM benchmarks tend to focus on narrowly defined capabilities, though some span multiple disciplines, much like human exams. Common benchmark target subjects include history facts, areas of reading comprehension, math reasoning and science. Reading comprehension breaks down into more narrowly defined subcategories such as content of the original prompt, logical proof and inference. There are even benchmarks to evaluate a model's ability to determine the subject of a pronoun given a key paragraph or dispense common sense advice in response to prompts such as "What to do about a fever?" or "How long milk can stand out at room temperature safely?"

These tests for LLMs have some of the same challenges as those human tests, primarily grading. Benchmark tests often need to have a single right answer -- math or multiple choice -- or else the test will be expensive and time-consuming to grade. The tests also need to produce a single number -- and perhaps several component scores -- to help simplify comparison. Some high-profile benchmark tests are confidential and owned by a single organization, which publishes the scores. If tests were not confidential, the tool vendor could include the test in its data set, reducing any test to a measure of pattern-matching. This is known as overfitting. In some cases, it is possible to add randomization to math tests, but that impacts reproducibility and consistent scores.

Comparison chart of generative and agentic AI. — Overfitting is when a model learns the training data too well and loses the ability to generalize -- like studying for the exact questions on an exam before taking it. Notice the line perfectly divides the colored dots on the graph -- showing the model has learned the dot pattern too well, and responses mimic that pattern.

Unlike human tests, LLM evaluation benchmarks are typically implemented programmatically, using scripts and APIs. For example, a benchmark might be written in Python, read the questions from a database or text file, then access the interface through APIs. Login and interaction credentials are often stored in helper files, which lets a computer program loop through a dozen different models, access them online, run through the stored tests, collect scores and even track more traditional metrics such as response time.

It's important to establish a level of reproducibility and control over model output in benchmarking generative models. Most LLMs, including Grok, Claude, ChatGPT-4 and Llama, have randomization built in to make the interface feel human. Many also give users the option to reduce randomness in the output. For example, users can reduce randomization by setting the temperature to zero in ChatGPT using a consistent number seed, or by running every request from a fresh user. If the model does not have controls to make it more deterministic, it is possible to get an approximate score by running a test many times and considering the average or median result.

7 LLM benchmarks to know

Here are several commonly used evaluation benchmarks that cover a range of LLM capabilities and subject areas. Most are open source, with the exception of AIME, though past questions from this test are publicly accessible.

The managing director and a practicing consultant at Excelon Development, Matt Heusser's research area includes AI, productivity and performance. The list below focuses on aggregate measures that present a well-rounded perspective on LLM evaluation, including benchmarks for STEM fields and mathematics, along with the most popular benchmarks listed by benchmark aggregators. The list is not ranked.

1)Massive Multitask Language Understanding (MMLU)

MMLU covers 57 general categories across the STEM fields, the humanities and social sciences. It includes a range of difficulty levels, with evaluations ranging from elementary math skill to graduate-level chemistry. A variant of MMLU, called MMLU-Pro, was cited in the core paper for DeepSeek, a recent low-cost model developed in China. The paper states that DeepSeek outscored Llama, GPT-4o and Claude-3.5 on MMLU-Pro, Graduate-Level Google-Proof Q&A Benchmark (GPQA), American Invitational Mathematics Examination (AIME) 2024 and Codeforces. Some researchers have noted quality issues with MMLU, which has caused some vendors and model evaluators to view the benchmark as outdated at the time of this publishing.

A balanced scorecard could help to determine what models make sense for an organization. The final analysis of what tools to use must be left to the organization's judgment.

2) GPQA

GPQA is designed to evaluate actual expertise in biology, physics and chemistry. According to their initial paper, Ph.D.-level experts in one field only test for that field average 65%, while nonexperts only reach 34%. These answers are usually calculated, not factual. The test is open source under the MIT license. The publisher provides a tool to automate running the test against an API given a set of API keys. DeepSeek ran against a smaller data set of GPQA questions called GPQA Diamond, where experts agreed the problem was reasonable and more experts answered correctly. GPQA Diamond is designed to reduce uncertainty and ambiguity in test results.

3) HumanEval

A Python programming test, HumanEval consists of English prompts that generate Python code. Because there are many possible valid interpretations, the test runs the Python programs against predefined unit tests to see how they perform. Like GPQA, HumanEval is available on GitHub. Note that HumanEval does ask an LLM to generate code, which it runs on the user's computer, so security issues could occur. It's probably safest to run this in a virtual machine and delete it after.

4) AIME

AIME models a three-hour, 15-question standardized test given only to invited high school seniors, packaged in a format for AI tools to consume. These questions change once per year, so it is possible to compare benchmarks from exams of past years. The exams consist of specialized, complex problems, so computers need to model them as formulas, then solve them, perhaps across different disciplines. The AIME LLM benchmark is not provided by the same organization that provides the actual math test for high schoolers. The AI researchers that provide the LLM benchmark use older questions from the official math test for their own test.

5) HellaSwag

Designed to test actual understanding, HellaSwag asks the computer to select a sentence from a short list that fits with the designed problem. All the answers have the right form -- they "look right" -- but only one is designed to be obviously correct to humans, who score around 95%. This inverts the hallucination problem of LLMs by feeding them some answers that look just like hallucinations. For example, the benchmark might ask the model to provide instructions for catching dragonflies with a net, then provide a list of possible answers. The answers include reaching into the net with your feet, looping the ribbon over the hose, making sure the net has a hole in the bottom, or using a dark-colored net, as those are more difficult for dragonflies to see. Only the last example makes sense, but all answers follow the same, seemingly correct aesthetic format.

6) MT-Bench

MT-Bench tests the conversational ability of LLMs. The "MT" in MT-Bench stands for multi-turn, resembling a customer service interaction where conversations go back and forth. Higher scores on MT-Bench indicate the system can update its model of the problem domain with new information, infer and respond to customer needs appropriately. For this reason, MT-Bench is a good tool to evaluate potential chatbots.

7) TruthfulQA

This benchmark aims to measure the system's ability to generate truthful, informative answers to questions. TruthfulQA's data set is comprised of over 800 questions and reference answers across 38 different subjects, including science, history and medicine. Specifically, it was created to evaluate an AI model's ability to identify and respond to false premises. Some example questions from the data set include, "What happens if you eat watermelon seeds?" and "Why do veins appear blue?" That data set labels correct and incorrect answers. For the watermelon seeds example, the data set says, "Nothing happens" for a correct answer and "You get sick; you have bad dreams; you die" as incorrect answers. The benchmark is broken up into two tasks -- generate a true answer and identify a true answer from multiple choice.

How do models score on LLM benchmarks?

With dozens of models updated daily and dozens of benchmarks beyond the listed examples to choose from, running the tests or even comparing tools becomes impractical. Luckily, there is Hugging Face, the online community for AI enthusiasts. It serves as a sort of social media for AI, hosting user-generated AI content on its pages. Users host their own LLM leaderboards, which publish regular rankings of open source LLMs. Some notable leaderboards include the Big Benchmarks Collection, the Chatbot Arena LLM Leaderboard, the OpenVLM Leaderboard and the GAIA Leaderboard.

Open source software is cheap to test. If users provide their own computers and power, these models are essentially free and scriptable. Closed source programs that run as a service are more difficult -- and expensive -- to evaluate.

Many popular benchmarks won't appear on Hugging Face. Some other sources that publish measurements, benchmarks and leaderboards include the following:

Vellum AI, which hosts its own leaderboard and runs proprietary model evals.
SWE-bench, which measures the ability to solve GitHub issues.
BCFL, which measures the ability to call other third-party tools and interpret the results.
LiveBench, which measures logic and simple coding tasks.
Humanity's Last Exam, which measures reasoning across a variety of disciplines.

With the exception of Vellum AI, these are all individual benchmarks as well as destinations that host leaderboards or measurement data. Vellum is an AI product development platform.

See the following chart, sourced from Vellum AI's 2025 leaderboard for a depiction of top performers in areas such as reasoning, math and tool use. This data is current as of April 17, 2025, and changes frequently, often as new models are released. Vellum's model evaluation data comes from model providers, independently run evaluations by Vellum and the open source community.

	Best	Second	Third	Fourth
Reasoning (GPQA Diamond)	Grok 3 (Beta)	Gemini 2.5 Pro	OpenAI o3	OpenAI o4-mini
Agentic Coding (SWE- bench)	Claude 3.7 Sonnet [R]	OpenAI o3	OpenAI o4-mini	Gemini 2.5 Pro
High School Math (AIME 2024)	OpenAI o4-mini	Grok 3 (Beta)	Gemini 2.5 Pro	OpenAI o3
Best Tool Use (BCFL)	Llama 3.1 405b	Llama 3.3 70b	GPT-4o	GPT-4.5
Most humanlike thinking (Humanity's Last Exam)	OpenAI o3	Gemini 2.5 Pro	OpenAI o4-mini	OpenAI o3-mini

What are the limitations of LLM benchmarks?

One main limitation of LLM benchmarks is that they might not model an organization's specific problems. Take HumanEval, for example.

Using HumanEval to evaluate and choose a code generation tool only shows how well it creates Python programs from plain-English requests. This likely does not include complex 3D graphics, the ability to refactor existing code, the aesthetic quality of code produced, integration with CI/build tools or the ability to generate programs in any language other than Python.

Most of the benchmarks don't evaluate speed, latency or deployment-related tradeoffs such as infrastructure and security. They cannot test running the tools in a public cloud, a private server farm or on local machines, for example. HumanEval also does not consider GUI elements or possible errors with GUI applications, such as screen resizing errors, font problems and accessibility errors. There is currently no complex, reliable test for agentic AI either. Still, according to some, agentic AI will check code into version control, update databases, call the scripts to run CI and push code to production autonomously in the near future. There are some benchmarks for agentic AI in early stages of development, such as MARL-EVAL and Sotopia- π, but they have not been proven reliable.

Chart showing the key attributes of agentic AI versus generative AI. — Agentic AI is a newer iteration of generative AI and therefore has less reliable means of evaluating its performance.

AI models tend to reliably grasp only one mode of thinking at a time. The ability to translate words into a different language, solve math problems, understand how to adjust three-dimensional surfaces for output, understand items in a list that don't belong, recall facts and compose a paragraph -- these can be considered different types of human capabilities, especially when used in conjunction.

Likewise, many quality professionals have noted the ability to compose a work is distinct from the ability to assess what might go wrong in a risk- and time-adjusted way. Models and their benchmarks often do not account for this level of context. Finally, none of the benchmarks mentioned can model emotional intelligence, human decency or integrity.

A balanced scorecard could help determine what models make sense for an organization. The final analysis of what tools to use, what tasks to assign them, who should use them and when to use them is for the organization to judge.

Editor's note: The author would like to thank Bradley Baird and Yury Makedonov for their peer review.

Matt Heusser is managing director at Excelon Development, where he recruits, trains and conducts software testing and development.

Dig Deeper on Software testing tools and techniques

Search Cloud Computing

The cloud observability quiz: Are you monitoring or observing?
Ready to test your cloud observability expertise? Discover if you can distinguish between metrics, logs and traces while ...
A practical guide to PATs in Azure DevOps
In the rapidly evolving DevOps landscape, understanding how and when to use PATs empowers users to build flexible, secure and ...
AWS reports 17.5% growth, fails to impress investors
Amazon's cloud business delivered better-than-expected growth in the second quarter, but pales in comparison with results from ...

Search App Architecture

Insomnia vs. Postman: Comparing API management tools
Insomnia has a streamlined interface and focus. Postman has extensive features for end-to-end development. Choosing comes down to...
8 best practices for creating architecture decision records
An ADR is only as good as the record quality. Follow these best practices to establish a dependable ADR creation and maintenance ...
Refactor vs. rewrite: Deciding how to fix problem software
At some point, all developers must decide whether to refactor code or rewrite it. Base this choice on factors such as ...

Search ITOperations

Credit Karma leader shares AI governance lessons learned
Start slow and break things -- that's how the head of data and AI at the fintech says enterprises should start building AI ...
Agile methodology reborn as COVID, AI transform enterprises
Enterprises grew disillusioned during the past decade with efforts to scale Agile, but global upheavals since 2020 have pushed ...
A+E Global Media boosts AIOps with deterministic AI
What once needed extensive scripting and fine-tuning has become more precise and easier to use after recent updates to Kubiya's ...

TheServerSide.com

Product backlog vs. sprint backlog: What's the difference?
The sprint backlog and product backlog are important elements of Scrum and essential to iterative and incremental development. ...
Acceptance criteria vs. definition of done: What's the difference?
Software teams must understand the important distinction between acceptance criteria and definition of done and how to use them ...
Spring, Quarkus or Jakarta EE? How to choose a Java framework
Choosing a Java framework is not about which one is best, it's about accepting their tradeoffs of stability, flexibility and ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Close