Distributed Classification by Divide and Conquer Approach

Session Number

CMPS 19

Advisor(s)

Qiang Wu, Middle Tennessee State University

Discipline

Computer Science

Start Date

17-4-2024 11:05 AM

End Date

17-4-2024 11:20 AM

Abstract

In this paper, we investigate the efficacy of the divide and conquer approach for implementing distributed logistic regression and distributed support vector machine (SVM) algorithms for classification of large-scale datasets. This approach is designed to handle datasets that exceed thecapacity of a single processor, necessitating the partitioning of data into multiple subsets. Logistic regression or SVM is then applied to each subset, yielding individual local classifiers. Subsequently, a global classifier is derived by aggregating these local classifiers to make the final decision. We propose three strategies for the aggregation stage: voting based on predicted labels, averaging of real-valued predictions, and averaging of posterior probabilities. Our analysis reveals that for distributed logistic regression, probability averaging is the most robust approach and is therefore recommended. Conversely, in the context of distributed SVM, probability averaging requires additional modeling but has a minimal impact on the performance. Therefore, functional averaging is recommended instead.

Share

COinS
 
Apr 17th, 11:05 AM Apr 17th, 11:20 AM

Distributed Classification by Divide and Conquer Approach

In this paper, we investigate the efficacy of the divide and conquer approach for implementing distributed logistic regression and distributed support vector machine (SVM) algorithms for classification of large-scale datasets. This approach is designed to handle datasets that exceed thecapacity of a single processor, necessitating the partitioning of data into multiple subsets. Logistic regression or SVM is then applied to each subset, yielding individual local classifiers. Subsequently, a global classifier is derived by aggregating these local classifiers to make the final decision. We propose three strategies for the aggregation stage: voting based on predicted labels, averaging of real-valued predictions, and averaging of posterior probabilities. Our analysis reveals that for distributed logistic regression, probability averaging is the most robust approach and is therefore recommended. Conversely, in the context of distributed SVM, probability averaging requires additional modeling but has a minimal impact on the performance. Therefore, functional averaging is recommended instead.