Short text classification applied to item description: Some methods evaluation




Text classification, Product description, Short text, Logistic regression, Bag of words


The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.

Author Biographies

Gilsiley Henrique Darú, Univerdidade de São Paulo - USP

MSc., Institute of Mathematics and Computer Sciences, USP, São Carlos, São Paulo

Felipe Daltrozo da Motta Motta, Universidade de São Paulo - USP

MSc., Institute of Mathematics and Computer Sciences, USP, São Carlos, São Paulo

Antonio Castelo, Universidade de São Paulo - USP

Prof. Dr., Institute of Mathematics and Computer Sciences, USP, São Carlos, São Paulo.

Gustavo Valentim Loch, Universidade Federal do Paraná - UFPR

Prof. Dr., PPGMNE, Universidade Federal do Paraná - UFPR, Curitiba, PR


Original Article