A Small Brazilian Portuguese Speech Corpus for Speaker Recognition Study

Alberto Yoshihiro Nakano; Hélio Rodrigues da Silva; Juliano Rodrigues Dourado; Felipe Walter Dafico Pfrimer

doi:10.5433/1679-0375.2024.v45.50518

A Small Brazilian Portuguese Speech Corpus for Speaker Recognition Study

Authors

Alberto Yoshihiro Nakano Universidade Tecnológica Federal do Paraná https://orcid.org/0000-0002-3757-1427
Hélio Rodrigues da Silva Universidade Tecnológica Federal do Paraná https://orcid.org/0009-0009-4864-1337
Juliano Rodrigues Dourado Universidade Tecnológica Federal do Paraná https://orcid.org/0009-0008-2459-4751
Felipe Walter Dafico Pfrimer Universidade Tecnológica Federal do Paraná https://orcid.org/0000-0002-0586-1585

DOI:

https://doi.org/10.5433/1679-0375.2024.v45.50518

Keywords:

Brazilian Portuguese speech corpus, GMM, MFCC, Speaker recognition

Abstract

A small Brazilian speech corpus was created for educational purposes to study a state-of-the-art speaker recognition system. The system uses the Gaussian Mixture Model (GMM) as a statistical model for speakers and employs the Mel-frequency cepstral coefficients (MFCC) as acoustic features. The results using clean and noisy speech are compatible with the expected results, showing that the bigger the mismatch between training and test conditions, the worse the results. The results also improve with the increase in the utterance length. Finally, the obtained results can be used as baselines to compare with other speaker statistical models created with different acoustic features in different acoustic conditions.

Downloads

Download data is not yet available.

Author Biographies

Alberto Yoshihiro Nakano, Universidade Tecnológica Federal do Paraná

Alberto Yoshihiro Nakano is an Associate Professor at the Federal University of Technology - Paraná, Toledo campus. He received his Master's Degree from the University of São Paulo, Brazil (2005) and his Dr. Eng. Degree from the Toyohashi University of Technology, Japan (2010).

Hélio Rodrigues da Silva, Universidade Tecnológica Federal do Paraná

Hélio Rodrigues da Silva received his Bachelor's Degree in Electronic Engineering (2018) and his Master's Degree (2022) from the Federal University of Technology - Paraná.

Juliano Rodrigues Dourado, Universidade Tecnológica Federal do Paraná

Juliano Rodrigues Dourado received his Bachelor's Degree in Electronic Engineering (2018) from the Federal University of Technology - Paraná.

Felipe Walter Dafico Pfrimer, Universidade Tecnológica Federal do Paraná

Felipe Walter Dafico Pfrimer is an Associate Professor at the Federal University of Technology - Paraná, Toledo campus. He received his Master's Degree in 2009 and his Dr. Eng. Degree in 2013, both from the University of Campinas, Brazil.

References

Alcaim, A., Solewicz, J. A., & Moraes, J. A. (1992). Frequência de Ocorrência dos Fones e Listas de Frases Foneticamente Balanceadas no Português Falado no Rio de Janeiro. Revista da Sociedade Brasileira de Telecomunicações, 7(1), 40-47.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Candido, A. J., Casanova, E., Soares, A., Oliveira, F. S., Oliveira, L., Fernandes, R. C. J., Silva, D. P. P., Fayet, F. G., Carlotto, B. B., Gris, L. R. S. e Aluísio, S. M. (2023). CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. Language Resources and Evaluation, 57, 1139-1171.

Casanova, E., Candido, A. J., Shulby, C. D., Oliveira, F. S., Teixeira, J. P., Ponti, M. A. e Aluísio, S. M. (2022). TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation, 56, 1043-1055.

Dempster, A. P., Laird, N. M. e Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological), 39(1), 1-38.

Diener, L., Vishkasougheh, M. R. e Schultz, T. (2020). CSL-EMG_Array: An Open Access Corpus for EMG-to-Speech Conversion. Proceedings Interspeech 2020, Shangai, China.

Jyothi, S. e Geethanjali, N. (2022). Arrythmia prediction from high dimensional electrocardiogram's data corpus using ensemble classification. International Journal of Health Sciences, 6(S1), 4790-4810.

Kinnunen, T., Karpov, E. e Franti, P. (2005). Real-time speaker identification and verification. IEEE Transactions on Audio, Speech, and Language Processing, 14(1), 277-288.

KuÄera, H. e Francis, W. N. (1967). Computational analysis of present-day American English. Brown University Press.

Kuo, J. e Lee-Messer, C. (2017). The stanford EEG corpus: A large open dataset of electroencephalograms from children and adults to support machine learning technology. IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 1-2.

Leite, P. H. L., Hoyle, E., Antelo, Á., Kruszielski, L. F. e Biscainho, L. W. P. (2022). A Corpus of Neutral Voice Speech in Brazilian Portuguese. In V. Pinheiro, P. Gamallo, R. Amaro, C. Scarton, F. Batista, D. Silva, C. Magro e H. Pinto (Eds.), Computational Processing of the Portuguese Language (pp. 344-352). 15th International Conference, PROPOR 2022, Fortaleza, Brazil.

Liu, Z., Wu, Z., Li, T., Li, J. e Shen, C. (2018). GMM and CNN hybrid method for short utterance speaker recognition. IEEE Transactions on Industrial Informatics, 14(7), 3244-3252.

Mathworks. (2024). MatLab - Designed for the way you think and the work you do.

Nassif, A. B., Shahin, I., Hamsa, S., Nemmour, N. e Hirose, K. (2021). CASA-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions. Applied Soft Computing, 103, 1-11.

Paulino, M. A., Costa, Y. M., Britto, A. S., Svaigen, A. R., Aylon, L. B. e Oliveira, L. E. (2018). A Brazilian speech database. In IEEE Conferences [Proceedings]. 30th International Conference on Tools with Artificial Intelligence (ICTAI). Volos, Greece, 234-241.

Rabiner, L. e Juang, B.-H. (1993). Fundamentals of speech recognition. Prentice-Hall.

Raso, T., Mello, H., & Mittmann, M. M. (2012). The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese. In N. Calzolari, K. Choukri, T. Declerck, M. U. DoÄŸan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12) (pp. 106-113). 8 International Conference on Language Resources and Evaluation. Istanbul, Turkey.

Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72-83.

Ynoguti, C. A., & Violaro, F. (2008). A Brazilian Portuguese speech database. In Sociedade Brasileira de Telecomunicações, SBrT2008 [Proceedings]. XXVI Simpósio Brasileiro de Telecomunicações, Rio de Janeiro, Brasil.

Zhang, H., Sun, A., Jing, W., Nan, G., Zhen, L., Zhou, J. T., & Goh, R. S. M. (2021). Video Corpus Moment Retrieval with Contrastive Learning. ArXiv, 1, 1-11.