BERTugues: A Novel BERT Transformer Model Pre-trained for Brazilian Portuguese
DOI:
https://doi.org/10.5433/1679-0375.2024.v45.50630Keywords:
large language models, LLMs, BERT, NLP, foundation modelsAbstract
Large Language Models (LLMs) are trained for English or multilingual versions, with superior performance in English. This disparity occurs because, in the training of multilingual models, only a relatively small amount of data is added for each additional language. Consequently, while these models can function in Portuguese, their performance is suboptimal. The first BERT model (Bidirectional Encoder Representations from Transformers) specifically trained for Brazilian Portuguese was BERTimbau in 2020, which enhanced performance across various text-related tasks. We followed the training approach of BERT/BERTimbau for BERTugues, while implementing several improvements. These included removing rarely used characters in Portuguese from the tokenizer, such as oriental characters, resulting in the addition of over 7,000 new tokens. As a result, the average length of sentence representations was reduced from 3.8 words with more than one token to 3.0, which positively impacted embedding performance by improving metrics relevant to classification problems. Two additional enhancements involved embedding emojis as tokens - an essential step for capturing conversational nuances - and filtering low-quality texts from the training dataset. These modifications improved performance across various tasks, raising the average F1 score from 64.8 % in BERTimbau to 67.9 % in BERTugues.
Downloads
References
Beautiful Soup Documentation. (2023). Beautiful Soup Documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#module-bs4
de Souza, F. C. (2020). BERTimbau: Modelos BERT pré-treinados para português brasileiro [Master’s Thesis, Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação]. Repositório. https://repositorio.unicamp.br/Busca/Download?codigoArquivo=466423&tipoMidia=0
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805
Fred, L. (2019). IMDB PT-BR. https://www.kaggle.com/datasets/luisfredgs/imdb-ptbr
IMDb. (2024). IMDb Service. https://www.imdb.com/pt/
Lhoest, Q., Villanova del Moral, A., Jernite, Y., Thakur, A., von Platen, P., Patil, S., Chaumond, J., Drame, M., Plu, J., Tunstall, L., Davison, J., Šaško, M., Chhablani, G., Malik, B., Brandeis, S., Le Scao, T., Sanh, V., Xu, C., Patry, N., … Wolf, T. (2021). Datasets: Uma biblioteca comunitária para processamento de linguagem natural. In H. Adel & S. Shi (Eds.), Anais da conferência de 2021 sobre métodos empíricos em processamento de linguagem natural: Demonstrações de sistemas (pp. 175–184). Associação para Linguística Computacional.
Llama Team. (2024). The Llama 3 Herd of Models. ArXiv, 3, 1–92.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Arxiv, 3, 1–12.
OpenAI. (2023). GPT-4 Technical Report.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf
Pires, R., Abonizio, H., Almeida, T. S., & Nogueira, R. (2023). Sabiá: Portuguese Large Language Models. In Sociedade Brasileira de Computação, Anais da Brazilian Conference on Intelligent Systems [Anais]. 13º Brazilian Conference on Intelligent Systems, Porto Alegre, Brasil.
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., Driessche, G. v. d., Hendricks, L. A., Rauh, M., Huang, P. - S., … Irving, G. (2021). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Arxiv, 2, 1–120.
Real, L., Fonseca, E., & Oliveira, H. G. (2020). The ASSIN 2 Shared Task: A Quick Overview. In P. Quaresma, R. Vieira, S. Aluísio, H. Moniz, F. Batista, & T. Gonçalves (Eds.), Computational Processing of the Portuguese Language (pp. 406–412). Springer International Publishing.
Ribeiro, E., Mamede, N., & Baptista, J. (2024, March). Automatic Text Readability Assessment in European Portuguese. In P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira, & R. Amaro (Eds.), Proceedings of the 16th International Conference on Computational Processing of Portuguese (pp. 97–107). Association for Computational Linguistics.
Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., & Furtado, V. (2023). LegalBert-pt: Um modelo de linguagem pré-treinado para o domínio jurídico do português brasileiro. In M. C. Naldi & R. A. C. Bianchi (Eds.), Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science (Vol. 14197). Springer, Cham.
Souza, F. C., Nogueira, R. F., & Lotufo, R. A. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In R. Cerri, & R. C. Prati (Eds.), Intelligent Systems (pp. 403–417). Springer, Cham.
Souza, F. C., Nogueira, R. F., & Lotufo, R. A. (2023). BERT Models for Brazilian Portuguese: Pretraining, Evaluation and Tokenization Analysis. Applied Soft Computing, 149, 110901.
Speer, R. (2019). ftfy: Fixes Text for You, Version 5.5.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Arxiv, 5, 1–15.
Viegas, C. F. O. (2022). JurisBERT: Transformer-Based Model for Embedding Legal Texts [Master’s Thesis, Universidade Federal de Mato Grosso do Sul]. Repositório.
Viegas, C. F. O., & Alfaneo. (2023). Brazilian-Legal-Text-Benchmark. https://github.com/alfaneo-ai/brazilian-legal-text-benchmark
Wagner, J. A., Filho, Wilkens, R., Idiart, M., & Villavicencio, A. (2018). O Corpus brWaC: Um novo recurso aberto para o Português Brasileiro. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (Eds.), Anais da décima primeira conferência internacional sobre recursos linguísticos e avaliação (LREC 2018). Associação Europeia de Recursos Linguísticos (ELRA). https://aclanthology.org/L18-1686
Wikimedia Downloads. (2024). https://dumps.wikimedia.org/backup-index.html
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., … Rush, A. (2020). Transformers: State-of-the-art natural language processing. In Q. Liu, & D. Schlangen (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45).
Zago, R. (2023). BERTugues-base-portuguese-cased. https://huggingface.co/ricardoz/BERTugues-base-portuguese-cased
Zucker, A. (2024). Huggingface tokenizers. https://github.com/huggingface/tokenizers
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Ricardo Mazza Zago, Luciane Agnoletti dos Santos Pedotti
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The Copyright Declaration for articles published in this journal is the author’s right. Since manuscripts are published in an open access Journal, they are free to use, with their own attributions, in educational and non-commercial applications. The Journal has the right to make, in the original document, changes regarding linguistic norms, orthography, and grammar, with the purpose of ensuring the standard norms of the language and the credibility of the Journal. It will, however, respect the writing style of the authors. When necessary, conceptual changes, corrections, or suggestions will be forwarded to the authors. In such cases, the manuscript shall be subjected to a new evaluation after revision. Responsibility for the opinions expressed in the manuscripts lies entirely with the authors.
This journal is licensed with a license Creative Commons Attribution-NonCommercial 4.0 International.