Evaluating LLM Arithmetic Capabilities Using External Tools

Session Number

CMPS(ai) 02

Advisor(s)

Murat Keçeli, Argonne National Laboratory

Discipline

Computer Science

Start Date

17-4-2025 2:15 PM

End Date

17-4-2025 2:30 PM

Abstract

The rapid advancements of Large Language Models (LLMs) have allowed a significant enhancement in natural language processing, allowing human-like text generation to be coupled with a similar level of reasoning. Although LLMs have shown success in various fields, they remain largely inconsistent in their arithmetic accuracy. This study aims to benchmark LLMs on their strengths, limitations, and overall practical implication in regards to using their arithmetic capabilities in realistic tasks. By forming a benchmark to evaluate LLM-generated answers against human evaluations, we assess their numerical reasoning abilities and problem- solving effectiveness on a variety of levels with a variety of situations. Our methodology involves literature review, analysis of existing benchmarks and an iterative process of developing a new evaluation framework to assess various LLMs, including ones commonly used by the public and even developing ones in order to fully grasp the status quo for LLM performance in arithmetic. Additionally, we explore the integration of external tools to enhance LLMs’ arithmetic accuracy to help refine LLM benchmarking standards and serve as a recommendation for their utilization in larger, scientific and technical workflows.

Share

COinS
 
Apr 17th, 2:15 PM Apr 17th, 2:30 PM

Evaluating LLM Arithmetic Capabilities Using External Tools

The rapid advancements of Large Language Models (LLMs) have allowed a significant enhancement in natural language processing, allowing human-like text generation to be coupled with a similar level of reasoning. Although LLMs have shown success in various fields, they remain largely inconsistent in their arithmetic accuracy. This study aims to benchmark LLMs on their strengths, limitations, and overall practical implication in regards to using their arithmetic capabilities in realistic tasks. By forming a benchmark to evaluate LLM-generated answers against human evaluations, we assess their numerical reasoning abilities and problem- solving effectiveness on a variety of levels with a variety of situations. Our methodology involves literature review, analysis of existing benchmarks and an iterative process of developing a new evaluation framework to assess various LLMs, including ones commonly used by the public and even developing ones in order to fully grasp the status quo for LLM performance in arithmetic. Additionally, we explore the integration of external tools to enhance LLMs’ arithmetic accuracy to help refine LLM benchmarking standards and serve as a recommendation for their utilization in larger, scientific and technical workflows.