IVML - publications


		about \| r&d \| publications \| courses \| people \| links

A. Alexopoulos, G. Drakopoulos, A. Kanavos, Ph. Mylonas, G. Vonitsanos

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Algorithms 2020, 13(3), 71, MDPI, March 2020

ABSTRACT

At the dawn of the 10V data era, there is a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. It is even possible to generate synthetic data inputs which fulfil certain criteria so that rare phenomena can be simulated and thoroughly examined. Big data growth poses two primary challenges, namely storing and processing. Regarding the former, there are certain new technologies enabling long term, efficient and reliable massive data storage. Concerning the latter, new large scale frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task and as such, many algorithmic techniques have been developed regarding this topic. In this work, a two-step architecture is proposed for both performing classification and for determining which dataset attributes contribute the most to classification in terms of precision, recall, as well as F1 metric. The proposed architecture initially performs a singular value decomposition to the matrix of observations in order to select a few and possibly transformed features capturing the essence of the remaining ones. Then, classification in various ways to the reduced attribute set is performed. The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The proposed architecture was tested with readily available Spark MLlib classifiers on the well-known datasets Higgs and PAMAP. The experiments based on the same Spark cluster indicate that the two-step architecture outperforms individual classifiers with respect to the three abovementioned metrics.

21 March , 2020

A. Alexopoulos, G. Drakopoulos, A. Kanavos, Ph. Mylonas, G. Vonitsanos, "Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark", Algorithms 2020, 13(3), 71, MDPI, March 2020

[

PDF] [ BibTex] [ Print] [

Back]