Distributed Classification by Divide and Conquer Approach
Session Number
CMPS 19
Advisor(s)
Qiang Wu, Middle Tennessee State University
Discipline
Computer Science
Start Date
17-4-2024 11:05 AM
End Date
17-4-2024 11:20 AM
Abstract
In this paper, we investigate the efficacy of the divide and conquer approach for implementing distributed logistic regression and distributed support vector machine (SVM) algorithms for classification of large-scale datasets. This approach is designed to handle datasets that exceed thecapacity of a single processor, necessitating the partitioning of data into multiple subsets. Logistic regression or SVM is then applied to each subset, yielding individual local classifiers. Subsequently, a global classifier is derived by aggregating these local classifiers to make the final decision. We propose three strategies for the aggregation stage: voting based on predicted labels, averaging of real-valued predictions, and averaging of posterior probabilities. Our analysis reveals that for distributed logistic regression, probability averaging is the most robust approach and is therefore recommended. Conversely, in the context of distributed SVM, probability averaging requires additional modeling but has a minimal impact on the performance. Therefore, functional averaging is recommended instead.
Distributed Classification by Divide and Conquer Approach
In this paper, we investigate the efficacy of the divide and conquer approach for implementing distributed logistic regression and distributed support vector machine (SVM) algorithms for classification of large-scale datasets. This approach is designed to handle datasets that exceed thecapacity of a single processor, necessitating the partitioning of data into multiple subsets. Logistic regression or SVM is then applied to each subset, yielding individual local classifiers. Subsequently, a global classifier is derived by aggregating these local classifiers to make the final decision. We propose three strategies for the aggregation stage: voting based on predicted labels, averaging of real-valued predictions, and averaging of posterior probabilities. Our analysis reveals that for distributed logistic regression, probability averaging is the most robust approach and is therefore recommended. Conversely, in the context of distributed SVM, probability averaging requires additional modeling but has a minimal impact on the performance. Therefore, functional averaging is recommended instead.