Topic Modeling of Bot Preferences in Tweets from the COVID-19 Parliamentary Inquiry in Brazil
DOI:
https://doi.org/10.5433/1679-0375.2025.v46.52599Keywords:
natural language processing, social media analysis, text mining, automated accounts detection, information disseminationAbstract
Twitter is a microblogging social network that allows users to send and receive short messages in text and image format, being one of the most widely used platforms of its kind. Given its relevance in various aspects of contemporary society, including politics, many users operate automated profiles (bots) that post hundreds or even thousands of tweets. This study applies the Latent Dirichlet Allocation (LDA) technique to identify whether bot users show preference for any of the modeled topics. Data were collected via the Twitter API between April 18 and May 30, 2021, using the keywords "CPI" (Parliamentary Inquiry Committee) and "COVID," resulting in 459,145 tweets in Portuguese from 109,027 distinct users. These users were analyzed through the Pegabot platform, which estimates the probability of an account being a bot. After preprocessing, only the users with 100 or more tweets during the period were retained, resulting in 26,966 observations from 189 accounts. LDA identified four main Topics: 1 - Health Secretary; 2 - senator Renan Calheiros; 3 - president Jair Bolsonaro; and 4 - the government. In all topics, bot accounts posts were more frequent than human users posts, with Topic 1 having the lowest proportion of accounts classified as bots.
Downloads
References
Agresti, A. (2018). An introduction to categorical data analysis. John Wiley & Sons.
Alvares, R. V. (2014). Algoritmos de Stemming e o Estudo de Proteomas [Tese de Doutorado]. Universidade Federal do Rio de Janeiro. https://www.pesc.coppe.ufrj.br/uploadfile/1398446767.pdf
Assenmacher, D., Clever, L., Frischlich, L., Quandt, T., Trautmann, H., & Grimme, C. (2020). Demystifying social bots: On the intelligence of automated social media actors. Social Media + Society, 6(3), 1-14. https://doi.org/10.1177/2056305120939264 DOI: https://doi.org/10.1177/2056305120939264
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Bolsover, G., & Howard, P. (2019). Chinese computational propaganda: Automation, algorithms and the manipulation of information about Chinese politics on Twitter andWeibo. Information, Communication & Society, 22(14), 2063-2080. https://doi.org/10.1080/1369118X.2018.1476576 DOI: https://doi.org/10.1080/1369118X.2018.1476576
Bulegon, H., & Moro, C. M. C. (2010). Mineração de texto e o processamento de linguagem natural em sumários de alta hospitalar. Journal of Health Informatics, 2(2), 51-56. https://jhi.sbis.org.br/index.php/jhi-sbis/article/view/5
Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781. https://doi.org/10.1016/j.neucom.2008.06.011 DOI: https://doi.org/10.1016/j.neucom.2008.06.011
Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology,37(1), 51-89. https://doi.org/10.1002/aris.1440370103 DOI: https://doi.org/10.1002/aris.1440370103
Ciribeli, J. P., & Paiva, V. H. P. (2011). Redes e Mídias Sociais na Internet: Realidades e Perspectivas de um Mundo Conectado. Mediação, 13(12), 57-74. https://revista.fumec.br/index.php/mediacao/article/view/509
Costa, P. H. E. C., Lima, J. R., Marques, R. A., Trindade, D. R., & Komati, K. S. (2021). Estudos de caso de análise de perfis de usuários agrupados por hashtags no Twitter. In Sociedade Brasileira de Computação, Anais da Escola Regional de Banco de Dados [Anais]. 16o Escola Regional de Banco de Dados, Santa Maria, Brasil. https://doi.org/10.5753/erbd.2021.17250 DOI: https://doi.org/10.5753/erbd.2021.17250
de Oliveira, A. C. S., Paixão, C. A., Morita, L. H. M., de Barros, R. C. B., & Ferreira, E. B. (2024). CPI da Covid-19 no Twitter: Uma análise da participação de robôs nas discussões e sentimentos observados. Esferas, (29), 1-23. https://doi.org/10.31501/esf.v1i29.14845 DOI: https://doi.org/10.31501/esf.v1i29.14845
de Sousa, M. d. C. E. (2015). A dinâmica da notícia nas redes sociais na internet: A forma de apresentação das postagens no Twitter e no Facebook. Revista Fronteiras, 17(2), 199-212. https://doi.org/10.4013/fem.2015.172.07 DOI: https://doi.org/10.4013/fem.2015.172.07
Deveaud, R., SanJuan, E., & Bellot, P. (2014). Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique, 17(1), 61-84. https://stm.cairn.info/revue-document-numerique-2014-1-page-61?lang=en DOI: https://doi.org/10.3166/dn.17.1.61-84
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.). (1996). Advances in knowledge discovery and data mining. American Association for Artificial Intelligence.
Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1-54. https://doi.org/10.18637/jss.v025.i05 DOI: https://doi.org/10.18637/jss.v025.i05
Feldman, R., & Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511546914
Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228-5235. https://doi.org/10.1073/pnas.0307752101 DOI: https://doi.org/10.1073/pnas.0307752101
Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1-30. https://doi.org/10.18637/jss.v040.i13 DOI: https://doi.org/10.18637/jss.v040.i13
Hothorn, T., Bretz, F.,&Westfall, P. (2008). Simultaneous inference in general parametric models. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 50(3), 346-363. https://doi.org/10.1002/bimj.200810425 DOI: https://doi.org/10.1002/bimj.200810425
Kearney, M. W. (2019). rtweet: Collecting and analyzing Twitter data. Journal of Open Source Software, 4(42), 1829. https://doi.org/10.21105/joss.01829 DOI: https://doi.org/10.21105/joss.01829
Kemp, S. (2021). Digital 2021: Global Overview Report. Datareportal. https://datareportal.com/reports/ digital-2021-global-overview-report
Krestel, R., & Fankhauser, P. (2010). Language Models and Topic Models for Personalizing Tag Recommendation. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. [Proceedings]. International Conference on Web Intelligence and Intelligent Agent Technology, Toronto, Canadá. https://doi.org/10.1109/WI-IAT.2010.29 DOI: https://doi.org/10.1109/WI-IAT.2010.29
Liu, X. (2019). A big data approach to examining social bots on Twitter. Journal of Services Marketing, 33(4), 369-379. https://doi.org/https://doi.org/10.1108/JSM-02-2018-0049 DOI: https://doi.org/10.1108/JSM-02-2018-0049
Martini, F., Samula, P., Keller, T. R., & Klinger, U. (2021). Bot, or not? comparing three methods for detecting social bots in five political discourses. Big Data & Society, 8(2), 1-13. https://doi.org/10.1177/20539517211033566 DOI: https://doi.org/10.1177/20539517211033566
Navega, S. (2002). Princípios Essenciais do Data Mining. Anais do Infoimagem 2002. Cenadem. http://www.intelliwise.com/reports/i2002.pdf
Nikita, M. (2020). ldatuning: Tuning of the Latent Dirichlet Allocation Models Parameters [R package version 1.0.2]. https://rdrr.io/cran/ldatuning/
Paganotti, I. (2021). Acolhimento e resistência a correções de fake news na pandemia: a experiência do robô Fátima, da agência Aos Fatos, no Twitter. Mídia e Cotidiano, 15(3), 169-193. https://doi.org/10.22409/rmc.v15i3.47883 DOI: https://doi.org/10.22409/rmc.v15i3.47883
Pegabot [Verificador de perfil Twitter]. (2018). https://pegabot.com.br/
RCore Team. (2020). R: A Language and Environment for Statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Recuero, R. (2009). Redes sociais na internet (1st ed.). Sulina.
Santos, A. E. G. O. (2020). Modelo Probabilístico de Tópicos e Estatística Multivariada Aplicados à Análise Textual: Um Módulo de Detecção de Conversas Fora do Contexto para Analisar Conversas em Grupo [Dissertação de Mestrado. Universidade Federal Rural do Semi-Árido; Universidade do Estado do Rio Grande do Norte]. Repositório. https://ppgcc.ufersa.edu.br/wp-content/uploads/sites/42/2021/02/Disserta%C3%A7%C3%A3oAdriano.pdf
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O'Reilly Media.
Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 439-460). Routledge. https://www.routledge.com/Handbook- of- Latent- Semantic- Analysis/Landauer-McNamara-Dennis-Kintsch/p/book/9781138004191
Yang, K.-C., Varol, O., Davis, C. A., Ferrara, E., Flammini, A., & Menczer, F. (2019). Arming the public with artificial intelligence to counter social bots. Human Behavior and Emerging Technologies, 1(1), 48-61. https://doi.org/10.1002/hbe2.115 DOI: https://doi.org/10.1002/hbe2.115
Žižka, J., Dařena, F., & Svoboda, A. (2019). Text mining with machine learning: principles and techniques. CRC Press. https://doi.org/10.1201/9780429469275 DOI: https://doi.org/10.1201/9780429469275
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Gabriel Thompson Marques Arruda, Anderson Castro Soares de Oliveira, Lia Hanna Martins Morita, José Nilton da Cruz

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The Copyright Declaration for articles published in this journal is the author's right. Since manuscripts are published in an open access Journal, they are free to use, with their own attributions, in educational and non-commercial applications. The Journal has the right to make, in the original document, changes regarding linguistic norms, orthography, and grammar, with the purpose of ensuring the standard norms of the language and the credibility of the Journal. It will, however, respect the writing style of the authors. When necessary, conceptual changes, corrections, or suggestions will be forwarded to the authors. In such cases, the manuscript shall be subjected to a new evaluation after revision. Responsibility for the opinions expressed in the manuscripts lies entirely with the authors.
This journal is licensed with a license Creative Commons Attribution-NonCommercial 4.0 International.