A Simulation-Based Sample Size Determination Package in R for Prediction Models
Session Number
3
Advisor(s)
Zheyang Wu, Worcester Polytechnic Institute
Location
A129
Discipline
Computer Science
Start Date
15-4-2026 2:15 PM
End Date
15-4-2026 3:00 PM
Abstract
n predictive studies, it is important to know the amount of data needed for a given model to predict with reliable accuracy. In principle, larger datasets allow for more accurate prediction but are more expensive; knowing the minimal data needed to achieve a target level of predictive accuracy can be more cost-effective. There are currently general rules of thumb (e.g., 10 samples per parameter in the predictive model) and other criteria-based methods for determining sample size, but these do not always translate directly to the predictive performance metrics that investigators care about (e.g., AUC, MSE). This project develops an R package that estimates the minimum sample size needed to achieve a user-specified criterion in predictive performance using a general simulation-based framework. In this framework, users specify models, metrics, and data generation rules to identify the smallest sample size needed to satisfy predefined accuracy criteria. In principle, it allows users to specify complex models (e.g., decision trees, neural networks) and arbitrary metrics for realistic studies where analytical sample size calculation is difficult. We created a workflow and implemented the code as an R package, tested it on real datasets, and compared it with other methods to verify its effectiveness.
A Simulation-Based Sample Size Determination Package in R for Prediction Models
A129
n predictive studies, it is important to know the amount of data needed for a given model to predict with reliable accuracy. In principle, larger datasets allow for more accurate prediction but are more expensive; knowing the minimal data needed to achieve a target level of predictive accuracy can be more cost-effective. There are currently general rules of thumb (e.g., 10 samples per parameter in the predictive model) and other criteria-based methods for determining sample size, but these do not always translate directly to the predictive performance metrics that investigators care about (e.g., AUC, MSE). This project develops an R package that estimates the minimum sample size needed to achieve a user-specified criterion in predictive performance using a general simulation-based framework. In this framework, users specify models, metrics, and data generation rules to identify the smallest sample size needed to satisfy predefined accuracy criteria. In principle, it allows users to specify complex models (e.g., decision trees, neural networks) and arbitrary metrics for realistic studies where analytical sample size calculation is difficult. We created a workflow and implemented the code as an R package, tested it on real datasets, and compared it with other methods to verify its effectiveness.