Short text classification applied to item description: Some methods evaluation

Short text classification applied to item description: Some methods evaluation

Authors

DOI:

https://doi.org/10.5433/1679-0375.2022v43n2p189

Keywords:

Text classification, Product description, Short text, Logistic regression, Bag of words

Abstract

The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.

Downloads

Download data is not yet available.

Author Biographies

Gilsiley Henrique Darú, Univerdidade de São Paulo - USP

MSc., Institute of Mathematics and Computer Sciences, USP, São Carlos, São Paulo

Felipe Daltrozo da Motta Motta, Universidade de São Paulo - USP

MSc., Institute of Mathematics and Computer Sciences, USP, São Carlos, São Paulo

Antonio Castelo, Universidade de São Paulo - USP

Prof. Dr., Institute of Mathematics and Computer Sciences, USP, São Carlos, São Paulo.

Gustavo Valentim Loch, Universidade Federal do Paraná - UFPR

Prof. Dr., PPGMNE, Universidade Federal do Paraná - UFPR, Curitiba, PR

References

AGGARWAL, C. C.; ZHAI, C. A survey of text classification algorithms. In: AGGARWAL, C. C.; ZHAI, C. (ed.). Mining text data. New York: Springer, 2012. p. 163-222. DOI: https://doi.org/10.1007/978-1-4614-3223-4_6. DOI: https://doi.org/10.1007/978-1-4614-3223-4_6

ALSMADI, I.; GAN, K. H. Review of short-text classification. International Journal of Web Information Systems, Bingley, v. 15, n. 2, p. 155-182, 2019. DOI: https://doi.org/10.1108/IJWIS-12-2017-0083. DOI: https://doi.org/10.1108/IJWIS-12-2017-0083

BAEZA-YATES, R.; RIBEIRO-NETO, B. Recuperação de Informação: conceitos e tecnologia das máquinas de busca. 2. ed. Porto Alegre: Bookman Editora, 2013.

BENGIO, Y.; GRANDVALET, Y. No unbiased estimator of the variance of k-fold cross-validation. Advances in Neural Information Processing Systems, San Mateo, v. 16, p. 1-8, 2003.

BHAVANI, A.; KUMAR, B. S. A review of state art of text classification algorithms. In: INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION, 5., 2021, Erode. Proceedings [...]. [Piscataway]: IEEE, 2021. p. 1484-1490. DOI: https://doi.org/10.1109/ICCMC51019.2021.9418262

DARU, G. H. Classificação produtos varejo CPG PTBR. [ S. l.]: Kaggle, 2022. Available from: https://www.kaggle.com/dsv/4265348https://www.kaggle.com/dsv/4265348. Access in: Dec. 28, 2022

PEDREGOSA, F.; VAROQUAUX, G.; GRAMFORT, A.; MICHEL, V.; THIRION, B.; GRISEL, O.; BLONDEL, M.; PRETTENHOFER, P.; WEISS, R.; DUBOURG, V.; VANDERPLAS, J.; PASSOS, A.; COURNAPEAU, D.; BRUCHER, M.; PERROT, M.; DUCHESNAY, E. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, Cambridge, v. 12, p. 2825-2830, 2011.

PRANCKEVICIUS, T.; MARCINKEVICIUS, V. Comparison of naive bayes, random forest, decision tree, sup- port vector machines, and logistic regression classifiers for text reviews classification. Baltic Journal of Modern Computing, Latvia, v. 5, n. 2, p. 221, 2017. DOI: https://doi.org/10.22364/bjmc.2017.5.2.05. DOI: https://doi.org/10.22364/bjmc.2017.5.2.05

ROSSUM, G. V.; DRAKE, F. L. Python 3 reference manual. Scotts Valley: CreateSpace, 2009.

SHAH, K.; PATEL, H.; SANGHVI, D.; SHAH, M. A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augmented Human Research, [London], v. 5, n. 1, p. 1-16, 2020. DOI: https://doi.org/10.1007/s41133-020-00032-0. DOI: https://doi.org/10.1007/s41133-020-00032-0

SILVA, R. M.; SANTOS, R. L.; ALMEIDA, T. A.; PARDO, T. A. Towards automatically filtering fake news in portuguese. Expert Systems with Applications, Elmsford, v. 146, p. 113-199, 2020. DOI: https://doi.org/10.1016/j.eswa.2020.113199

SONG, G.; YE, Y.; DU, X.; HUANG, X.; BIE S. Short text classification: a survey. Journal of multimedia, Oulu, v. 9, n. 5, p. 634-643, 2014. DOI: https://doi.org/10.4304/jmm.9.5.635-643. DOI: https://doi.org/10.4304/jmm.9.5.635-643

ZHANG, Y.; JIN, R.; ZHOU, Z.-H. Understanding bag- of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, Berlin, v. 1, n. 1, p. 43-52, 2010. DOI: https://doi.org/10.1007/s13042-010-0001-0

Downloads

Published

2022-12-30

How to Cite

Darú, G. H., Motta, F. D. da M., Castelo, A., & Loch, G. V. (2022). Short text classification applied to item description: Some methods evaluation. Semina: Ciências Exatas E Tecnológicas, 43(2), 189–198. https://doi.org/10.5433/1679-0375.2022v43n2p189

Issue

Section

Original Article
Loading...