IVML  
  about | r&d | publications | courses | people | links
   

E. Dritsas, M. Trigka, G. Vonitsanos, A. Kanavos, Ph. Mylonas
An Apache Spark Implementation for Text Document Clustering
17th International Workshop on Semantic & Social Media Adaptation & Personalization (SMAP '22), November 3-4, 2022, Online
ABSTRACT
As the volume of data generated and stored on a daily basis is constantly increasing, the need for finding techniques in terms of the automated discovery of information from them has arisen. This purpose can be effectively solved with the use of text mining, which uses methods derived from data mining, information retrieval, machine learning, as well as natural language processing. This paper addresses the problem of extracting textual information from large collections of documents by efficiently exploiting clustering techniques in a cloud computing infrastructure. The clustering was performed using three different algorithms, namely k-Means, Bisecting k-Means, and Gaussian Mixture Model (GMM). To evaluate the quality of these methods, we experimented in the Apache Spark distributed environment, on several well-known datasets, the documents of which have been manually clustered.
03 November , 2022
E. Dritsas, M. Trigka, G. Vonitsanos, A. Kanavos, Ph. Mylonas, "An Apache Spark Implementation for Text Document Clustering", 17th International Workshop on Semantic & Social Media Adaptation & Personalization (SMAP '22), November 3-4, 2022, Online
[ save PDF] [ BibTex] [ Print] [ Back]

© 00 The Image, Video and Multimedia Systems Laboratory - v1.12