| G. Vonitsanos, A. Kanavos, Ph. Mylonas |
| A Systematic Comparison of Statistical and Neural Frameworks for Spanish POS Tagging |
| IEEE International Conference on Big Data (IEEE BigData 2025), December 8-11, 2025, Macau, China |
|
ABSTRACT
|
| This paper presents a systematic comparison of statistical and neural frameworks for Spanish Part-of-Speech (POS) tagging, focusing on three widely used NLP toolkits: NLTK, spaCy, and Stanza. A unified experimental protocol was implemented using the Spanish portion of the CoNLL-2002 corpus, with consistent preprocessing, sentence reconstruction, and an XPOS→UPOS mapping to ensure cross-framework comparability. The results show that NLTK˘s statistical n-gram backoff tagger achieves the highest overall performance (97.16% accuracy, 1,373 errors) with negligible runtime (0.39 s), confirming the strong advantage of corpus-aligned tagsets and lightweight probabilistic modeling. Among the neural systems, Stanza delivers higher linguistic fidelity (87.40% F1) and fewer severe confusion errors than spaCy, but incurs substantial computational overhead due to expensive initialization and BiLSTM inference. spaCy offers significantly faster processing yet exhibits the highest error count, reflecting the limitations of compact UPOS-based pipelines when evaluated against fine-grained XPOS annotations. Overall, the study demonstrates how corpus–model alignment, tagset granularity, and architectural complexity jointly shape accuracy, stability, and runtime efficiency.
|
| 08 December , 2025 |
| G. Vonitsanos, A. Kanavos, Ph. Mylonas, "A Systematic Comparison of Statistical and Neural Frameworks for Spanish POS Tagging", IEEE International Conference on Big Data (IEEE BigData 2025), December 8-11, 2025, Macau, China |
[ PDF] [
BibTex] [
Print] [
Back] |