Topic Modeling of Bot Preferences in Tweets from the COVID-19 Parliamentary Inquiry in Brazil
Arruda, G. T.M.; de Oliveira, A. C. S., Morita, L. H. M.; da Cruz, J. N.
DOI 10.5433/1679-0375.2025.v46.52599
Citation Semin., Ciênc. Exatas Tecnol. 2025, v. 46: e52599
Received: March 26, 2025 Received in revised for: June 11, 2025 Accepted: June 23, 2025 Available online: July 8, 2025
Abstract:
Twitter is a microblogging social network that allows users to send and receive short messages in text and image format, being one of the most widely used platforms of its kind. Given its relevance in various aspects of contemporary society, including politics, many users operate automated profiles (bots) that post hundreds or even thousands of tweets. This study applies the Latent Dirichlet Allocation (LDA) technique to identify whether bot users show preference for any of the modeled topics. Data were collected via the Twitter API between April 18 and May 30, 2021, using the keywords “CPI” (Parliamentary Inquiry Committee) and “COVID,” resulting in 459,145 tweets in Portuguese from 109,027 distinct users. These users were analyzed through the Pegabot platform, which estimates the probability of an account being a bot. After preprocessing, only the users with 100 or more tweets during the period were retained, resulting in 26,966 observations from 189 accounts. LDA identified four main Topics: 1 - Health Secretary; 2 - senator Renan Calheiros; 3 - president Jair Bolsonaro; and 4 - the government. In all topics, bot accounts posts were more frequent than human users posts, with Topic 1 having the lowest proportion of accounts classified as bots.
Keywords: natural language processing, social media analysis, text mining, automated accounts detection, information dissemination
Introduction
The internet has revolutionized social interactions, providing users with the ability to create, modify, interpret, and evaluate a vast array of content. In this context, social networks have emerged as the primary platforms for online interaction . According to the 2023 global report on the digital world, approximately 5 billion people are users of at least one social media platform, representing more than half of the global population . This number indicates that digitalization is a consolidated reality, with a massive volume of data being continuously generated.
Social media platforms have become privileged spaces for the production and dissemination of various types of content, transforming users into active producers of information, opinions, and ideas . To maximize information dissemination, the activity of automated profiles (bots) on social media has become increasingly common. These bots are programmed to focus on specific content and to attract the attention of targeted segments of users .
Given the large volume of data generated daily and the increasing presence of bots, it is essential to develop tools to monitor and analyze the behavior and patterns of these automated agents . The process of text mining involves multiple stages, including information extraction, data cleaning, classification, and clustering. Topic modeling, particularly through Latent Dirichlet Allocation (LDA), aims to classify text by identifying latent topics within documents. LDA is a probabilistic modeling technique that assumes each document is a mixture of topics, and each topic is characterized by a distribution over words.
The vast amount of data on social media poses challenges to traditional analytical methods. In the present study, the LDA technique was employed for topic modeling, enabling a deeper understanding of user behavior and the operational patterns of bots on social networks. Given the high volume of data generated on Twitter and the need to structure this information, the main objective of this study is to apply topic modeling techniques, with emphasis on the LDA model, to identify the preferences of users previously classified as bots by the Pegabot platform.
The application of the LDA model to Twitter involves collecting, organizing, and modeling data using the aforementioned technique in order to determine whether users identified as bots exhibit preferences for specific topics generated by the model. Thus, the study seeks to answer the following research question: how can the LDA technique be used to identify and analyze the influence of bot activity in the dissemination of information and the formation of topics on Twitter?
Theoretical framework
Social networks
Social networks are essential tools in contemporary society, connecting individuals and organizations based on shared interests, and facilitating real-time communication and information dissemination. In 2023, approximately 5 billion people used social media platforms, representing 62.3% of the global population .
Platforms such as Facebook, Instagram, LinkedIn, Twitter, YouTube, Spotify, and Netflix serve distinct roles, ranging from social interaction and content sharing to business promotion and information diffusion. The broad reach of these media enables the amplification of discourse, the construction of digital identities, and the emergence of new forms of political and cultural engagement .
In addition to fostering social and professional interactions, social media platforms have become strategic spaces for digital marketing, social mobilization, and news circulation. Companies and influencers use these tools to strengthen brands and engage audiences, while governments and institutions explore their potential for public communication and political campaigns .
However, the widespread use of these platforms also brings challenges such as the spread of misinformation, manipulation of debates through bots, and threats to user privacy. Therefore, understanding the dynamics of social networks and their societal impact is essential for ensuring more aware and responsible use .
User behavior and bot influence
Studies of user behavior on Twitter during politically relevant events highlight the growing influence of bots on social networks, emphasizing the need to understand the impact of automated accounts on the dissemination of information and opinions . Through data collection and analysis, it was observed that the COVID-19 Parliamentary Inquiry Committee (CPI) was a prominent topic on the platform, with a high volume of tweets posted daily.
The presence of bots in these discussions is significant, as these automated agents can influence information diffusion and shape public perception. Detecting bots and understanding their content dissemination strategies are crucial for conducting contextualized analyses of social media discourse. Bots on Twitter—often new accounts with few followers—frequently post intensively to influence trends and debates, underscoring the need for regulation.
Analyses show that bots and humans tend to focus on different types of content, with bots more often targeting institutional topics. Such manipulation may distort public perception and affect democratic integrity, highlighting the importance of transparency and authenticity in digital interactions.
Additionally, it was found that one-third of the fake news about the pandemic in 2020 followed three main strategies: downplaying the threat, accusing certain actors of benefiting from the disease’s spread, and promoting unproven treatments . Corrections issued by fact-checkers received limited user response, ranging from acceptance to outright rejection, suggesting user apathy or resistance to engaging with fact-checkers or automated accounts.
Bot detection
analyzed the behavior of Twitter profiles from multiple perspectives. However, the Botometer tool used in their methodology exhibited a significant limitation: its accuracy was restricted primarily to English-language content. This constraint negatively impacted the detection of bots in Portuguese, reducing the effectiveness of identifying accounts responsible for spreading harmful or misleading content.
Despite these limitations, the analysis identified patterns of automated behavior, suggesting the presence of disinformation networks operating on the platform. To complement their investigation, the authors employed Gephi software to visualize the structure of interactions between profiles through information transmission graphs. This approach highlighted central users involved in the online discourse surrounding the dismissal of former Health Minister Luiz Henrique Mandetta during the COVID-19 pandemic in Brazil, demonstrating the role of specific profiles in amplifying the debate and propagating particular narratives.
used the PEGABOT tool to analyze bot activity in propagating hashtags on Twitter, even after the topics ceased to trend. The study compared two hashtags: #BolsonaroDay, which showed signs of bot-driven amplification, and #TowelDay, which had predominantly human participation. Results showed that over 45% of profiles posting under the first hashtag exhibited signs of bot activity, a proportion that increased to 91% among those posting more than 10 tweets with the same hashtag. Furthermore, 6 out of the 10 most frequent retweeters were classified as bots.
In contrast, #TowelDay showed an opposite pattern, with over 81% of profiles identified as human. Among users posting more than 10 tweets with the hashtag, about 50% were human, and 9 out of the 10 most active retweeters were also classified as human. As a next step, the study proposes making the PEGABOT tool publicly accessible and validating its results using additional methodologies and tools to improve bot detection on social media platforms.
Text mining
Text Mining is a process within Knowledge Discovery in Databases (KDD) that employs analytical and extraction techniques to derive meaningful insights from unstructured textual data. It identifies implicit and useful patterns that would be difficult to retrieve through traditional methods . The process involves information extraction, data cleaning, classification, and clustering, ultimately transforming textual data into a structured numerical matrix for analysis.
KDD is an interactive process aimed at identifying valid, useful, and interpretable patterns in data . Its main phases include: domain understanding, creation of the target dataset, data cleaning and preprocessing, data transformation, selection of mining methods, exploratory analysis, model and hypothesis definition, pattern extraction, and interpretation of results for knowledge-based decision-making.
Natural Language Processing (NLP) is fundamental to text mining, combining computational techniques with statistical methods to analyze and interpret human language . NLP involves four primary stages: morphological analysis, syntactic analysis, semantic analysis, and pragmatic analysis .
Most text data sources are incomplete, noisy, and redundant, making preprocessing an essential step. Key techniques include the removal of hashtags, mentions, white spaces, punctuation, numbers, URLs, and stopwords, as well as case normalization. Stemming algorithms are also applied to reduce morphological word variations and extract root forms .
Topic modeling
Topic Modeling is essential for organizing and summarizing large text corpora by using machine learning algorithms to identify textual patterns and group words into latent topics . One of the most widely used methods is Latent Dirichlet Allocation (LDA), proposed by (Blei et al., 2003), which enables the inference of the latent topic structure within a document collection.
From a probabilistic perspective, LDA employs Dirichlet distributions to estimate topic-term and document-topic relationships, allowing each document to be represented as a mixture of topics. The model operates by sampling multinomial variables for each topic and document, as well as selecting topics and terms for each word position within documents. Due to the interdependence between observed and latent variables, inference in LDA is challenging and typically performed using methods such as Gibbs Sampling or Variational Inference .
LDA belongs to the class of topic models in which the dependent variable is qualitative (called topic) and is generated from independent variables (text terms). This method assumes that each topic has a set of words (terms) that can define the entire document, and a document can be a mixture of topics .
In probabilistic notation, let \(P(z)\) denote the topic distribution for a specific document \(d\). Each document has a conditional probability distribution \(P(w|z)\), where \(w\) are the words and \(z\) the topics.
Thus, the generalized expression is given by equation(1): \[P(w_{i}) = \sum_{j=1}^{T} P(w_{i} \mid z_{i}=j) \cdot P(z_{i}=j), \label{1}\] where \(T\) is the number of topics, \(P(w \mid z=j)\) represents the word distribution for topic \(j\), and \(P(z)\) is the topic distribution for document \(d\).
The LDA model estimates topic-term distributions using the Dirichlet distribution, whose probability density function is given by: \[ f(z;\alpha) = \frac{1}{B(\alpha)}\prod_{i=1}^{K}z_{i}^{\alpha_{i}-1},\] where \(z = (Z_1, \ldots, Z_K)\) is a \(K\)-dimensional variable, where \(0 \leq z_i \leq 1\) and \(\sum_{i=1}^K z_i = 1\); \(\alpha~=~(\alpha_1, \ldots, \alpha_K)\) are the hyperparameters of the distribution; and \(B(\alpha)\) is the Beta function, which can be expressed using the Gamma function:
\[B(\alpha) = \frac{\prod_{i=1}^{K} \Gamma(\alpha_{i})}{\Gamma\left(\sum_{i=1}^{K} \alpha_{i}\right)}.\]
In the LDA model, the variables \(\phi\) and \(\theta\) are defined, where \(\phi\) is an \(n\)-dimensional variable, with \(n\) representing the vocabulary size, and \(\theta\) is a \(K\)-dimensional variable, with \(K\) representing the number of topics.
Assuming a document \(d_j\), the LDA model can be executed as follows:
1cm0cm
Sample \(K\) multinomials \(\phi_k \sim Dir(\beta)\), one for each topic \(k\);
Sample \(m\) multinomials \(\theta_j \sim Dir(\alpha)\) for the document \(d_j\);
For each position \(i\) of the words in document \(d_j\):
Select a topic \(z_{j,i}\) from the distribution \(\theta_j\);
Select a word \(w_{j,i}\) from the distribution \(\phi_{z_{j,i}}\).
Considering the observed and latent variables, the joint distribution is: \[\begin{aligned} p(z,w,\phi,\theta \mid \alpha,\beta) &=& \prod_{k=1}^{K} p(\phi_k \mid \beta) \nonumber \\ &\cdot& \prod_{j=1}^{M} p(\theta_j \mid \alpha) \left( \prod_{i=1}^{V} p(z_{j,i} \mid \theta_j) \cdot p(w_{j,i} \mid z_{j,i}, \phi_{z_{j,i}}) \right). \end{aligned}\] Based on the model in equation (4), it is evident that there is a strong dependency between observed and latent variables. Therefore, the main challenge in LDA is to estimate: \[p(z, \phi, \theta \mid w, \alpha, \beta),\] where \(w\) are all the observed words in the document collection.
There are several inference methods for LDA; the most commonly used are Gibbs Sampling and Variational Inference .
Materials and methods
Data collection
The tweet collection was carried out using the R software through the rtweet package , which allows access to the Twitter API and enables the collection of up to 18,000 tweets every 15 minutes.
Data were collected daily between April 18 and May 30, 2021, using the keywords “CPI” and “COVID,” resulting in a total of approximately 459,000 tweets in Portuguese. These tweets were generated by approximately 109,000 unique users. Each profile was analyzed by the PEGABOT software (Pegabot, ) to estimate the probability of being a bot. The higher the percentage assigned by PEGABOT, the higher the likelihood that the profile is automated rather than operated by a real user.
Data selection
From the total of 109,000 users, only those who posted 100 or more tweets during the collection period were considered, ensuring a dataset composed of users with different probabilities of being identified as bots. For each selected user, all tweets collected between April 18 and May 30, 2021, were analyzed.
Topic modeling
The tweets were preprocessed to minimize irrelevant information contained in the posts. The following steps were applied: text normalization to lowercase, removal of hashtags and mentions, deletion of URLs, removal of punctuation and numbers, among other noise. Preprocessing was performed using functions from the tm package in R .
In topic modeling, determining the number of topics significantly influences model performance. To determine this value, the Cao Juan metrics (Cao et al., 2009) and Deveaud metrics (Deveaud et al., 2014) were used.
The Cao Juan metrics adopts an adaptive method for selecting the best LDA model, based on density. According to this metric, the optimal number of topics is where the average cosine distance between topics reaches its minimum value. On the other hand, the Deveaud metrics performs an LDA analysis over a range of possible topics. For each LDA model, the number of latent concepts is estimated by optimizing the information divergence \(D\) between all topic pairs \((k_i, k_j)\). The optimal number of topics is the one that maximizes the divergence \(D\).
To apply the Cao Juan and Deveaud metrics, the ldatuning package in R was used.
Once the number of topics was defined and the dataset was preprocessed, topic modeling was performed using the LDA technique. For model fitting, the topicmodels package in R was used.
Tweet classification
Based on the probability of profiles being bots, profiles with a probability lower than 0.7 were classified as non-bots, while those with a probability greater than or equal to 0.7 were classified as bots.
With the LDA model fitted, the probability of each tweet belonging to each topic was extracted. Thus, a given tweet was considered to belong to the topic for which it had the highest score (probability). As a result, each tweet was assigned both the bot classification (yes or no) and its most probable topic. Using this information, the distribution of tweets by topic and user type was analyzed.
After classifying the tweets via the fitted LDA model and obtaining the respective probabilities, the proportions of posts made by bot and non-bot users were calculated. The response variable “bot” is categorical (yes or no), and therefore, a multiple comparison of the proportions of bots across topics was performed using a Generalized Linear Model (GLM) with binomial distribution, through the glm function available in the stats package in R .
According to , assuming categorical data follows a binomial distribution is more accurate than using a normal approximation. Subsequently, multiple comparisons were conducted using the glht function from the multcomp package .
Results and discussion
Database description
During the data collection period, a total of 459,145 tweets were obtained from 109,027 distinct users. These tweets originated from 390 different posting sources, with the five most frequent being Twitter for Android (59.22%), Twitter Web App (19.37%), Twitter for iPhone (19.04%), Twitter for iPad, and TweetDeck (0.39%), illustrated in Figure 1. The top three sources account for over 97% of the dataset.
Twitter bot automation can be performed through various custom platforms that enable the programming of specific behaviors, such as automated posting, retweets, likes, and replies to other users. Moreover, these bots may operate using standard clients such as Twitter for Android or iPhone, which are common interfaces for human users. This ability to use standard clients makes it difficult to identify bots based solely on the posting source, as both humans and bots may utilize the same tools to interact with the platform. The use of public and private Twitter APIs also facilitates automation by allowing developers to create scripts and applications that interact automatically with the social network. Therefore, the posting source is not always a reliable indicator of bot activity, since automation can be masked as typical human behavior .
Figure 2 shows the classification of users into bots and non-bots, based on the results from the Pegabot platform. A majority of 81.48% of the accounts were identified as non-bots, while 18.52% were classified as bots. This distinction is relevant for understanding how automated accounts may influence the spread of content and engagement patterns within the dataset.
conducted a study involving 6,435,932 Twitter accounts and found that 1,084,967 (16.9%) were classified as bots. Similarly, selected a sample of 122,884 Twitter user accounts that produced 263,821 tweets related to five political discourses in five Western democracies, identifying 27,363 bot accounts (22.0% of all accounts). These studies highlight the significant prevalence of bots on Twitter, particularly in political contexts.
This study focused on users who posted 100 or more tweets, reducing the dataset to 27,120 tweets from 189 distinct accounts. Among these, 75% of users posted up to 156 tweets during the data collection period. User classification revealed that 67.52% were considered bots and 32.48% non-bots, Figure 3, indicating a higher prevalence of bots among the most active users.
The tendency of bots on Twitter to generate a higher number of posts was also observed in previous studies such as , who emphasized that bots aim to send as many tweets as possible to reach a broader audience. The results of this study support these findings, suggesting that bots play a substantial role in content generation, especially among the most active users. The greater activity of bots, as identified in both our study and in and , reinforces the importance of considering posting frequency when analyzing bot influence on Twitter.
Latent Dirichlet Allocation
Figure 4 shows the number of topics suggested using the Cao Juan and Deveaud metrics (Cao et al., 2009; Deveaud et al., 2014). According to the Cao Juan metrics, the minimum average cosine distance is achieved with \(k = 9\) topics, although a similar value occurs at \(k = 4\). For the Deveaud metrics, the maximum information divergence is reached at \(k = 4\). Therefore, the LDA model was fitted with 4 topics.
| (a) | (b) |
![]() | ![]() |
After defining the number of topics, the LDA model was fitted using a Gibbs sampler with 50,000 iterations. The word clouds for the four topics generated by the LDA model are presented in Figure 5.
2pt
| (a) | (b) | (c) | (d) |
![]() | ![]() | ![]() | ![]() |
In Figure 5, the most frequent terms for each topic were:
Topic 1: "governo sobr pazuello cloroquina vacina mandetta depoimento via video hoje genocidio queiroga";
Topic 2: "nao renan calheiro president senador contra brasil relat diz pode senado ser pai agora filho corrupcao corrupto";
Topic 3: "bolsonaro president todo crime stf sao covid corrupto povo lula politico verdad medo";
Topic 4: "vai governador nao senador saud ministro pra investigar ser estado ate quer investigado".
The word clouds help identify the main themes discussed in each topic. Topic 1 is mostly associated with the Health Secretary, Topic 2 refers to Senator Renan Calheiros, Topic 3 focuses on President Bolsonaro, and Topic 4 addresses general issues related to the government.
Tweets are a reflection of ongoing societal discussions, derived from various sources such as newspapers, official websites, and institutional events. Additionally, users themselves act as disseminators of information within the dynamics of the social network . Therefore, the topics generated tend to reflect the debates that occurred during the data collection period.
Figure 6, which shows the distribution of tweet proportions by topic over time, indicates that all four topics maintained a relatively balanced proportion throughout the data collection period. This suggests that all four topics were widely discussed during this time. Such distribution implies no single topic dominated the discourse, indicating a diversity of interests and concerns among users.
Furthermore, the temporal analysis of tweets reveals that specific events may have influenced the increase or decrease in discussion of specific topics.
Conclusions
This study presented a detailed analysis of bot activity on Twitter during a specific period, employing advanced text mining techniques, such as Latent Dirichlet Allocation (LDA). The focus was on the COVID-19 Parliamentary Inquiry Committee (CPI), with approximately 459,000 tweets in Portuguese collected between April 18 and May 30, 2021, generated by around 109,000 distinct users.
The analysis of tweet distribution by posting source showed that most publications originated from standard Twitter clients, such as Twitter for Android and Twitter Web App. This highlights the difficulty of identifying bots based solely on the posting source, since both humans and bots may use the same tools to interact on the platform. Therefore, it underscores the importance of employing more sophisticated methods to detect automated behavior, such as analyzing activity patterns and applying machine learning algorithms to identify typical bot characteristics.
The classification of users into bots and non-bots revealed that 18.52% were identified as bots. However, when considering only users who posted more than 100 tweets during the collection period, this proportion increased to 67.52%. This finding is consistent with previous studies, such as those by and , which also identified a significant presence of bots on Twitter. The higher activity levels among bots, as identified in both our study and the referenced literature, reinforce the importance of considering posting frequency when analyzing the influence of bots on social networks.
The application of the LDA technique enabled the identification of four main discussion topics related to the COVID-19 CPI: issues related to the Health Secretary, Senator Renan Calheiros, President Jair Bolsonaro, and the government in general. The balanced distribution of tweet proportions by topic over time indicates that these subjects were widely discussed throughout the data collection period, reflecting the diversity of user interests and concerns.
In all topics, users classified as bots published more tweets than those classified as non-bots. A significant level of bot activity was observed in amplifying information dissemination and influencing public debate. However, in Topic 1, which was associated with the Health Secretary, there was a lower proportion of tweets generated by bots compared to the other topics. These results emphasize the significant influence of bots in spreading information on Twitter, especially on topics related to political events and figures.
This study contributes to a better understanding of bot behavior dynamics on Twitter and highlights the importance of advanced analytical techniques in identifying automated behavior patterns. Future research may explore the application of these techniques in other contexts and timeframes, as well as develop more effective methods for detecting and mitigating bot influence on social media platforms. It is essential to continue developing and improving tools for detecting and curbing automated activity on social networks, aiming to preserve the transparency and reliability of information shared online.
Author contributions
G. T. M. Arruda was responsible for conceptualizing, data curation, formal analysis, investigation, methodology, and writing the original draft. A. C. S. de Oliveira contributed through supervision, validation, and writing–review and editing. L. H. M. Morita also participated in the supervision of the project, as well as in visualization and writing, review, and editing. J. N. da Cruz provided technical support, assisted with formal analysis, and contributed to writing, review, and editing.
Conflicts of interest
The authors declare no conflict of interest.
Acknowledgments
The authors would like to thank the Federal University of Mato Grosso (UFMT) and the Núcleo de Estudos, Pesquisas e Inovação em Estatística Aplicada (Center for Studies, Research and Innovation in Applied Statistics) for their support in the development of this study.
References
R. V. Alvares (2014). Algoritmos de Stemming e o Estudo de Proteomas.
Agresti, Alan (2018). An introduction to categorical data analysis. John Wiley \& Sons.
Assenmacher, D., Clever, L., Frischlich, L., Quandt, T., Trautmann, H. & Grimme, C. (2020). Demystifying social bots: On the intelligence of automated social media actors. 6(3), 1--14. https://doi.org/10.1177/2056305120939264
Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet Allocation. 993--1022.
Bolsover, Gillian & Howard, Philip (2019). Chinese computational propaganda: Automation, algorithms and the manipulation of information about Chinese politics on Twitter and Weibo. 22(14), 2063--2080. https://doi.org/10.1080/1369118X.2018.1476576
Bulegon, H. & Moro, C. M. C. (2010). Mineração de texto e o processamento de linguagem natural em sumários de alta hospitalar. 2(2), 51--56.
Cao, Juan, Xia, Tian, Li, Jintao, Zhang, Yongdong & Tang, Shuicheng (2009). A density-based method for adaptive {LDA. 72(7-9), 1775--1781. https://doi.org/10.1016/j.neucom.2008.06.011
Chowdhury, G. G. (2003). Natural language processing. 37(1), 51--89. https://doi.org/10.1002/aris.1440370103
Ciribeli, J. P. & Paiva, V. H. P. (2011). Redes e Mídias Sociais na Internet: Realidades e Perspectivas de um Mundo Conectado. 13(12), 57--74.
Costa, P. H. E. C., Lima, J. R., Marques, R. A., Trindade, D. R. & Komati, K. S. (2021). Estudos de caso de análise de perfis de usuários agrupados por hashtags no Twitter. In Sociedade Brasileira de Computação, Anais da Escola Regional de Banco de Dados [Anais]. 16º Escola Regional de Banco de Dados, Santa Maria, Brasil. https://doi.org/10.5753/erbd.2021.17250
Kemp, Simon (2021). Digital 2021: Global Overview Report. Datareportal.
Deveaud, Romain, SanJuan, Eric & Bellot, Patrice (2014). Accurate and effective latent concept modeling for ad hoc information retrieval. 17(1), 61--84. https://doi.org/10.3166/DN.17.1.61-84
de Oliveira, A. C. S., Paixão, C. A., Morita, L. H. M., de Barros, R. C. B. & Ferreira, E. B. (2024). CPI da Covid-19 no Twitter: Uma análise da participação de robôs nas discussões e sentimentos observados. 1--23.
de Sousa, M. d. C. E. (2015). A dinâmica da notícia nas redes sociais na internet: A forma de apresentação das postagens no Twitter e no Facebook. 17(2), 199--212.
Griffiths, T. & Steyvers, M. (2004). Finding Scientific Topics. 101(Suppl. 1), 5228--5235.
(1996). Advances in knowledge discovery and data mining. American Association for Artificial Intelligence.
Feinerer, Ingo, Hornik, Kurt & Meyer, David (2008). Text Mining Infrastructure in R. 25(5), 1–54. https://doi.org/10.18637/jss.v025.i05
Feldman, Ronen & Sanger, James (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.
Hothorn, T., Bretz, F. & Westfall, P. (2008). Simultaneous inference in general parametric models. 50(3), 346--363. https://doi.org/10.18637/jss.v040.i13
Krestel, R. & Fankhauser, P. (2010). Language Models and Topic Models for Personalizing Tag Recommendation.
Liu, X. (2019). A big data approach to examining social bots on Twitter. Routledge. 33(4), 369-379. https://doi.org/https://doi.org/10.1108/JSM-02-2018-0049
Martini, F., Samula, P., Keller, T. R. & Klinger, U. (2021). Bot, or not? Comparing three methods for detecting social bots in five political discourses. 8(2), 1--13.
Paganotti, I. (2021). Acolhimento e resistência a correções de fake news na pandemia: a experiência do robô Fátima, da agência Aos Fatos, no Twitter. 15(3), 169--193.
(2018). Pegabot [Verificador de perfil Twitter].
RCore Team (2020). R: A Language and Environment for Statistical Computing.
Raquel Recuero (2009). Redes sociais na internet. Sulina. 4(42), 206. https://doi.org/10.21105/joss.01829
Santos, A. E. G. O. (2020). Modelo Probabilístico de Tópicos e Estatística Multivariada Aplicados à Análise Textual: Um Módulo de Detecção de Conversas Fora do Contexto para Analisar Conversas em Grupo.
Silge, Julia & Robinson, David (2017). Text mining with R: A tidy approach. O'Reilly Media. 194.
Steyvers, M. & Griffiths, T. (2007). Probabilistic Topic Models. Routledge. 439--460.
Zizka, Jan, Darena, Filip & Svoboda, Ales (2019). Text mining with machine learning: principles and techniques. CRC Press. https://doi.org/10.1201/9780429469275
Yang, Kai-Cheng, Varol, Onur, Davis, Clayton A., Ferrara, Emilio, Flammini, Alessandro & Menczer, Filippo (2019). Arming the public with artificial intelligence to counter social bots. 1(1), 48--61. https://doi.org/10.1002/hbe2.115





