An R Package to Determine Sample Size for Desired Predictive Power in Linear Regression

Session Number

CMPS 36

Advisor(s)

Zheyang Wu, Worchester Polytechnic Institute

Discipline

Computer Science

Start Date

17-4-2025 2:45 PM

End Date

17-4-2025 3:00 PM

Abstract

In predictive studies, it is important to know the amount of data needed to obtain a given prediction accuracy. Typically, models built on larger datasets produce more accurate results but are more expensive, so knowing the minimal number of samples needed to achieve a certain accuracy level can save costs. This project creates an R package that generates the efficient sample size needed to obtain given criterion for predictive accuracy. The criterion used is based on the predictive mean square error (PMSE) and relative proportional reduction of PMSE. The function allows for different measures of effect size as inputs, and accounts for number of predictors and their correlations. It can use parameters like predictor error variance and the covariance matrix of predictors, or it can use other values like R2 and Cohen’s f2, in obtaining a sample size. These values can typically be obtained from previous studies or appropriate assumptions. The function is applied to a real study on pain intensity in older black women, demonstrating a practical application in planning prediction-related studies. These calculations are also validated by simulations on the data.


Share

COinS
 
Apr 17th, 2:45 PM Apr 17th, 3:00 PM

An R Package to Determine Sample Size for Desired Predictive Power in Linear Regression

In predictive studies, it is important to know the amount of data needed to obtain a given prediction accuracy. Typically, models built on larger datasets produce more accurate results but are more expensive, so knowing the minimal number of samples needed to achieve a certain accuracy level can save costs. This project creates an R package that generates the efficient sample size needed to obtain given criterion for predictive accuracy. The criterion used is based on the predictive mean square error (PMSE) and relative proportional reduction of PMSE. The function allows for different measures of effect size as inputs, and accounts for number of predictors and their correlations. It can use parameters like predictor error variance and the covariance matrix of predictors, or it can use other values like R2 and Cohen’s f2, in obtaining a sample size. These values can typically be obtained from previous studies or appropriate assumptions. The function is applied to a real study on pain intensity in older black women, demonstrating a practical application in planning prediction-related studies. These calculations are also validated by simulations on the data.