Topic Modeling of Bot Preferences in Tweets from the COVID-19 Parliamentary Inquiry in Brazil

Topic Modeling of Bot Preferences in Tweets from the COVID-19 Parliamentary Inquiry in Brazil

Authors

DOI:

https://doi.org/10.5433/1679-0375.2025.v46.52599

Keywords:

natural language processing, social media analysis, text mining, automated accounts detection, information dissemination

Abstract

Twitter is a microblogging social network that allows users to send and receive short messages in text and image format, being one of the most widely used platforms of its kind. Given its relevance in various aspects of contemporary society, including politics, many users operate automated profiles (bots) that post hundreds or even thousands of tweets. This study applies the Latent Dirichlet Allocation (LDA) technique to identify whether bot users show preference for any of the modeled topics. Data were collected via the Twitter API between April 18 and May 30, 2021, using the keywords "CPI" (Parliamentary Inquiry Committee) and "COVID," resulting in 459,145 tweets in Portuguese from 109,027 distinct users. These users were analyzed through the Pegabot platform, which estimates the probability of an account being a bot. After preprocessing, only the users with 100 or more tweets during the period were retained, resulting in 26,966 observations from 189 accounts. LDA identified four main Topics: 1 - Health Secretary; 2 - senator Renan Calheiros; 3 - president Jair Bolsonaro; and 4 - the government. In all topics, bot accounts posts were more frequent than human users posts, with Topic 1 having the lowest proportion of accounts classified as bots.

Downloads

Download data is not yet available.

Author Biographies

Gabriel Thompson Marques Arruda, Universidade Federal de Mato Grosso

Bachelor of Science in Statistics, Federal University of Mato Grosso, Cuiabá, Mato Grosso, Brazil

Anderson Castro Soares de Oliveira, Universidade Federal de Mato Grosso

Professor with Doctorate, Department of Statistics, Federal University of Mato Grosso, Cuiabá, Mato Grosso, Brazil.

Lia Hanna Martins Morita, Universidade Federal de Mato Grosso

Professor with Doctorate, Department of Statistics, Federal University of Mato Grosso, Cuiabá, Mato Grosso, Brazil.

José Nilton da Cruz, Universidade Federal de Mato Grosso

Professor with Doctorate, Department of Statistics, Federal University of Mato Grosso, Cuiabá, Mato Grosso, Brazil.

References

Agresti, A. (2018). An introduction to categorical data analysis. John Wiley & Sons.

Alvares, R. V. (2014). Algoritmos de Stemming e o Estudo de Proteomas [Tese de Doutorado]. Universidade Federal do Rio de Janeiro. https://www.pesc.coppe.ufrj.br/uploadfile/1398446767.pdf

Assenmacher, D., Clever, L., Frischlich, L., Quandt, T., Trautmann, H., & Grimme, C. (2020). Demystifying social bots: On the intelligence of automated social media actors. Social Media + Society, 6(3), 1-14. https://doi.org/10.1177/2056305120939264 DOI: https://doi.org/10.1177/2056305120939264

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Bolsover, G., & Howard, P. (2019). Chinese computational propaganda: Automation, algorithms and the manipulation of information about Chinese politics on Twitter andWeibo. Information, Communication & Society, 22(14), 2063-2080. https://doi.org/10.1080/1369118X.2018.1476576 DOI: https://doi.org/10.1080/1369118X.2018.1476576

Bulegon, H., & Moro, C. M. C. (2010). Mineração de texto e o processamento de linguagem natural em sumários de alta hospitalar. Journal of Health Informatics, 2(2), 51-56. https://jhi.sbis.org.br/index.php/jhi-sbis/article/view/5

Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72(7-9), 1775-1781. https://doi.org/10.1016/j.neucom.2008.06.011 DOI: https://doi.org/10.1016/j.neucom.2008.06.011

Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology,37(1), 51-89. https://doi.org/10.1002/aris.1440370103 DOI: https://doi.org/10.1002/aris.1440370103

Ciribeli, J. P., & Paiva, V. H. P. (2011). Redes e Mídias Sociais na Internet: Realidades e Perspectivas de um Mundo Conectado. Mediação, 13(12), 57-74. https://revista.fumec.br/index.php/mediacao/article/view/509

Costa, P. H. E. C., Lima, J. R., Marques, R. A., Trindade, D. R., & Komati, K. S. (2021). Estudos de caso de análise de perfis de usuários agrupados por hashtags no Twitter. In Sociedade Brasileira de Computação, Anais da Escola Regional de Banco de Dados [Anais]. 16o Escola Regional de Banco de Dados, Santa Maria, Brasil. https://doi.org/10.5753/erbd.2021.17250 DOI: https://doi.org/10.5753/erbd.2021.17250

de Oliveira, A. C. S., Paixão, C. A., Morita, L. H. M., de Barros, R. C. B., & Ferreira, E. B. (2024). CPI da Covid-19 no Twitter: Uma análise da participação de robôs nas discussões e sentimentos observados. Esferas, (29), 1-23. https://doi.org/10.31501/esf.v1i29.14845 DOI: https://doi.org/10.31501/esf.v1i29.14845

de Sousa, M. d. C. E. (2015). A dinâmica da notícia nas redes sociais na internet: A forma de apresentação das postagens no Twitter e no Facebook. Revista Fronteiras, 17(2), 199-212. https://doi.org/10.4013/fem.2015.172.07 DOI: https://doi.org/10.4013/fem.2015.172.07

Deveaud, R., SanJuan, E., & Bellot, P. (2014). Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique, 17(1), 61-84. https://stm.cairn.info/revue-document-numerique-2014-1-page-61?lang=en DOI: https://doi.org/10.3166/dn.17.1.61-84

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.). (1996). Advances in knowledge discovery and data mining. American Association for Artificial Intelligence.

Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1-54. https://doi.org/10.18637/jss.v025.i05 DOI: https://doi.org/10.18637/jss.v025.i05

Feldman, R., & Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511546914

Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228-5235. https://doi.org/10.1073/pnas.0307752101 DOI: https://doi.org/10.1073/pnas.0307752101

Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1-30. https://doi.org/10.18637/jss.v040.i13 DOI: https://doi.org/10.18637/jss.v040.i13

Hothorn, T., Bretz, F.,&Westfall, P. (2008). Simultaneous inference in general parametric models. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 50(3), 346-363. https://doi.org/10.1002/bimj.200810425 DOI: https://doi.org/10.1002/bimj.200810425

Kearney, M. W. (2019). rtweet: Collecting and analyzing Twitter data. Journal of Open Source Software, 4(42), 1829. https://doi.org/10.21105/joss.01829 DOI: https://doi.org/10.21105/joss.01829

Kemp, S. (2021). Digital 2021: Global Overview Report. Datareportal. https://datareportal.com/reports/ digital-2021-global-overview-report

Krestel, R., & Fankhauser, P. (2010). Language Models and Topic Models for Personalizing Tag Recommendation. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. [Proceedings]. International Conference on Web Intelligence and Intelligent Agent Technology, Toronto, Canadá. https://doi.org/10.1109/WI-IAT.2010.29 DOI: https://doi.org/10.1109/WI-IAT.2010.29

Liu, X. (2019). A big data approach to examining social bots on Twitter. Journal of Services Marketing, 33(4), 369-379. https://doi.org/https://doi.org/10.1108/JSM-02-2018-0049 DOI: https://doi.org/10.1108/JSM-02-2018-0049

Martini, F., Samula, P., Keller, T. R., & Klinger, U. (2021). Bot, or not? comparing three methods for detecting social bots in five political discourses. Big Data & Society, 8(2), 1-13. https://doi.org/10.1177/20539517211033566 DOI: https://doi.org/10.1177/20539517211033566

Navega, S. (2002). Princípios Essenciais do Data Mining. Anais do Infoimagem 2002. Cenadem. http://www.intelliwise.com/reports/i2002.pdf

Nikita, M. (2020). ldatuning: Tuning of the Latent Dirichlet Allocation Models Parameters [R package version 1.0.2]. https://rdrr.io/cran/ldatuning/

Paganotti, I. (2021). Acolhimento e resistência a correções de fake news na pandemia: a experiência do robô Fátima, da agência Aos Fatos, no Twitter. Mídia e Cotidiano, 15(3), 169-193. https://doi.org/10.22409/rmc.v15i3.47883 DOI: https://doi.org/10.22409/rmc.v15i3.47883

Pegabot [Verificador de perfil Twitter]. (2018). https://pegabot.com.br/

RCore Team. (2020). R: A Language and Environment for Statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Recuero, R. (2009). Redes sociais na internet (1st ed.). Sulina.

Santos, A. E. G. O. (2020). Modelo Probabilístico de Tópicos e Estatística Multivariada Aplicados à Análise Textual: Um Módulo de Detecção de Conversas Fora do Contexto para Analisar Conversas em Grupo [Dissertação de Mestrado. Universidade Federal Rural do Semi-Árido; Universidade do Estado do Rio Grande do Norte]. Repositório. https://ppgcc.ufersa.edu.br/wp-content/uploads/sites/42/2021/02/Disserta%C3%A7%C3%A3oAdriano.pdf

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O'Reilly Media.

Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 439-460). Routledge. https://www.routledge.com/Handbook- of- Latent- Semantic- Analysis/Landauer-McNamara-Dennis-Kintsch/p/book/9781138004191

Yang, K.-C., Varol, O., Davis, C. A., Ferrara, E., Flammini, A., & Menczer, F. (2019). Arming the public with artificial intelligence to counter social bots. Human Behavior and Emerging Technologies, 1(1), 48-61. https://doi.org/10.1002/hbe2.115 DOI: https://doi.org/10.1002/hbe2.115

Žižka, J., Dařena, F., & Svoboda, A. (2019). Text mining with machine learning: principles and techniques. CRC Press. https://doi.org/10.1201/9780429469275 DOI: https://doi.org/10.1201/9780429469275

Downloads

Published

2025-07-08

How to Cite

Arruda, G. T. M., de Oliveira, A. C. S., Morita, L. H. M., & da Cruz, J. N. (2025). Topic Modeling of Bot Preferences in Tweets from the COVID-19 Parliamentary Inquiry in Brazil. Semina: Ciências Exatas E Tecnológicas, 46, e52599. https://doi.org/10.5433/1679-0375.2025.v46.52599

Issue

Section

Statistic
Loading...