Evaluating LLM Arithmetic Capabilities Using External Tools
Session Number
CMPS(ai) 02
Advisor(s)
Murat Keçeli, Argonne National Laboratory
Discipline
Computer Science
Start Date
17-4-2025 2:15 PM
End Date
17-4-2025 2:30 PM
Abstract
The rapid advancements of Large Language Models (LLMs) have allowed a significant enhancement in natural language processing, allowing human-like text generation to be coupled with a similar level of reasoning. Although LLMs have shown success in various fields, they remain largely inconsistent in their arithmetic accuracy. This study aims to benchmark LLMs on their strengths, limitations, and overall practical implication in regards to using their arithmetic capabilities in realistic tasks. By forming a benchmark to evaluate LLM-generated answers against human evaluations, we assess their numerical reasoning abilities and problem- solving effectiveness on a variety of levels with a variety of situations. Our methodology involves literature review, analysis of existing benchmarks and an iterative process of developing a new evaluation framework to assess various LLMs, including ones commonly used by the public and even developing ones in order to fully grasp the status quo for LLM performance in arithmetic. Additionally, we explore the integration of external tools to enhance LLMs’ arithmetic accuracy to help refine LLM benchmarking standards and serve as a recommendation for their utilization in larger, scientific and technical workflows.
Evaluating LLM Arithmetic Capabilities Using External Tools
The rapid advancements of Large Language Models (LLMs) have allowed a significant enhancement in natural language processing, allowing human-like text generation to be coupled with a similar level of reasoning. Although LLMs have shown success in various fields, they remain largely inconsistent in their arithmetic accuracy. This study aims to benchmark LLMs on their strengths, limitations, and overall practical implication in regards to using their arithmetic capabilities in realistic tasks. By forming a benchmark to evaluate LLM-generated answers against human evaluations, we assess their numerical reasoning abilities and problem- solving effectiveness on a variety of levels with a variety of situations. Our methodology involves literature review, analysis of existing benchmarks and an iterative process of developing a new evaluation framework to assess various LLMs, including ones commonly used by the public and even developing ones in order to fully grasp the status quo for LLM performance in arithmetic. Additionally, we explore the integration of external tools to enhance LLMs’ arithmetic accuracy to help refine LLM benchmarking standards and serve as a recommendation for their utilization in larger, scientific and technical workflows.