Avaliação do desempenho de um software de sumarização automática de textos

Hamilton Rodrigues Tabosa; Osvaldo de Souza; José Carlos dos Santos Cândido; Ana Cristina Azevedo Ursulino Melo; Keila Giullianna Braga Reis

doi:10.5433/1981-8920.2020v25n1p189

Authors

Hamilton Rodrigues Tabosa Universidade Federal do Ceará http://orcid.org/0000-0002-2757-195X
Osvaldo de Souza Universidade Federal do Ceará
José Carlos dos Santos Cândido Universidade Federal do Ceará
Ana Cristina Azevedo Ursulino Melo Universidade Federal do Ceará
Keila Giullianna Braga Reis Universidade Federal do Ceará

DOI:

https://doi.org/10.5433/1981-8920.2020v25n1p189

Keywords:

Automatic Summarization of Texts, Access to Information, Natural Language Processing, Mediation (Practice)

Abstract

Intrudoction: Since 2014 we have developed a research to produce a software (prototype) that would be able to elaborate automatic summaries of texts based on techniques of Natural Language Processing and frequency statistics of words. The first empirical tests of the tool generated results that indicated a significant reduction of the dimensionality of the texts, with considerable preservation of their semantic value. Objective: In this article, we present the results of the continuity of our investigative work, based on a human evaluation of the quality of these abstracts from blind tests. Metodology: A group of three librarians received a mixed and unidentified block of abstracts - produced by humans and the automatic abstracts made by the software - and carried out an evaluation, according to the criteria of grammatical correctness, preservation of central ideas, coherence and readability, extension of abstract, whether there was paraphrase or copy of fragments, and if there was introduction of ideas not contained in the original text. Results: The results showed that in four of the five evaluation criteria adopted, there was a qualitative equivalence between the abstracts produced by humans and those produced by the software, which seems to represent a relative success since the prototype could replace a person in the resume activity texts without leaving anything to be desired, except in the fifth evaluation center, referring to the dimension of the abstract, in which the text produced by the software was pointed out as extensive beyond what was necessary. Conclusions: Despite the good results of the prototype, we realized the need for improvements in its performance, as well as to evaluate it by more comprehensive methods, from more representative samples and by a larger group of evaluators.

Downloads

Download data is not yet available.

Author Biographies

Hamilton Rodrigues Tabosa, Universidade Federal do Ceará

Doutor em Ciência da Informação pela Universidade Federal da Paraíba. Professor do Departamento de Ciências da Informação da Universidade Federal do Ceará.

Osvaldo de Souza, Universidade Federal do Ceará

Doutor em Engenharia de Teleinformática pela Universidade Federal do Ceará (UFC). Professor do Departamento de ciências da Informação da UFC.

José Carlos dos Santos Cândido, Universidade Federal do Ceará

Graduando em Biblioteconomia pela Univeridade Federal do Ceará. Bolsista PIBIC.

Ana Cristina Azevedo Ursulino Melo, Universidade Federal do Ceará

Mestra em Avaliação de Políticas Públicas e Graduada em Biblioteconomia pela Universidade Federal do Ceará.

Keila Giullianna Braga Reis, Universidade Federal do Ceará

Graduada em Biblioteconomia pela Universidade Federal do Ceará.

References

ALUÍSIO, S. M; PINHEIRO, G. M.; FINGER, M. NUNES, M. G. V; TAGNIN, S. E. The lacioweb project: overview and issues in brazilian portuguese corpora creation. [S. l.: s. n.], 2003. Disponível em: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.523.2121. Acesso em: 10 dez. 2019.

BARONI, M.; DINU, G.; KRUSZEWSKI, G. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 52., 2014. Baltimore, Maryland. Anais [...]. Baltimore, Maryland: Associação de Linguística Computacional, 2014. p.238-247. 2014. Disponível em: https://www.aclweb.org/anthology/P14-1023/. Acesso em: 10 dez. 2019.

BENGIO, Y.; DUCHARME, R.; VINCENT, P.; JAUVIN, C. A neural probabilistic language model. Journal of machine learning research, v. 3, p. 1137-1155. 2003. Disponível em: http://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf. Acesso em: 10 dez. 2019.

BORGES, G. S. B. Indexação automática de documentos textuais: proposta de critérios essenciais. 2009. 111 f. Dissertação (Mestrado em Ciência da Informação) - Escola de Ciência da Informação, Universidade Federal de Minas Gerais, Minas Gerais, 2009. Disponível em: https://repositorio.ufmg.br/bitstream/1843/ECID7W5JH9/1/dissertacao_graciane_2009.pdf. Acesso em: 13 dez. 2019.

BRIN, S.; PAGE, L. The anatomy of a large-scale hypertextual web search engine. In: Conferência Internacional da World Wide Web (WWW 1998), 7., 1998, Brisbane, Austrália. Anais [...]. Brisbane, Austrália: Elsevier Science, 1998. p. 107-117. Disponível em: http://snap.stanford.edu/class/cs224wreadings/Brin98Anatomy.pdf. Acesso em: 13 dez. 2019.

CABRAL, L. S. Uma plataforma para sumarização automática de textos independente de idioma. 2015. 138 f. Tese (Doutorado em Engenharia Elétrica) - Universidade Federal de Pernambuco. Programa de Pós-Graduação em Engenharia Elétrica, Recife, 2015. Disponível em: https://www.ufpe.br/documents/39830/745800/54_LucianoCabral/ef123409-aa67-4222-9fd2-4410708ef26d. Acesso em: 13 dez. 2019.

COSTA, M. A. A.; BRUNO, M. Uma comparação sistemática de diferentes abordagens para a sumarização automática extrativa de textos em português. Linguamática, v. 7, n. 1, p. 23-40. 2015. Disponível em: https://www.linguamatica.com/index.php/linguamatica/index. Acesso em: 10 dez 2019.

ERKAN, G.; RADEV, D. LexRank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. (JAIR), v. 22, p. 457-479, 2004. Disponível em: https://arxiv.org/abs/1109.2128. Acesso em: 13 dez. 2019.

GONZALEZ, M.; LIMA, V. L. S. Recuperação de informação e processamento da linguagem natural. In: CONGRESSO DA SOCIEDADE BRASILEIRA DE COMPUTAÇÃO. 23., 2003, Campinas. Anais [...]. Campinas: Jornada de MiniCursos de Inteligência Artificial, 2003.

HARTMANN, N. S.; FONSECA, E.; SHULBY, C.; TREVISO, M. V.; RODRIGUES, J. S.; ALUÍSIO, S. M. Portuguese word embeddings: evaluating on word analogies and natural language tasks. Nova Iorque: Universidade Cornell, 2017. Disponível em: https://arxiv.org/pdf/1708.06025.pdf. Acesso em: 14 jan. 2020.

IRIGUTI, A. H.; FELTRIM, V. D. Avaliando atributos para a classificação de estrutura retórica em resumos científicos. Linguamática, v. 11, n. 1, p. 41-53, 2019. Disponível em: https://linguamatica.com/index.php/linguamatica/article/view/273/451. Acesso em: 10 dez. 2019.

LANCASTER, F. W. Indexação e sumários: teoria e prática. 2. ed. Brasília: Briquet de Lemos, 2004.

LUHN, H. P. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, n. 1, v. 4, p. 309-317, 1957.

PEREIRA, S. L. Processamento de linguagem natural. [S. l.: s. n.], 2011. Disponível em: https://www.ime.usp.br/~slago/IA-pln.pdf. Acesso em: 09 fev. 2019.

RINO, L. H. M.; PARDO, T. A. S. A. A sumarização automática de textos: principais características e metodologias. In: VIEIRA, R. (org.). JAIA - Jornada de Atualização em Inteligência Artificial. Campinas: [s. n.], 2003. p. 203-245.

SALTON, G.; BUCKLEY, C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, v. 24, n. 5, p. 513-523, 1988.

SOUZA, O.; TABOSA, H. R.; OLIVEIRA, D. M.; OLIVEIRA, M. H. S. Um método de sumarização automática de textos através de dados estatísticos e Processamento de Linguagem Natural. Informação & Sociedade: Estudos, João Pessoa, v. 27, n. 3, p. 307-320, set./dez. 2017. Disponível em: https://www.brapci.inf.br/index.php/article/download/60421. Acesso em: 28 jan. 2019.

SPARCK-JONES, K. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, v. 28, n. 1, p. 11-21, 1993.

TAKAMURA, H.; OKUMURA, M. Text summarization model based on the budgeted median problem. In: ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT. 18., 2009, Hong Kong. Anais [...]. Nova Iorque: Association for Computing Machinery, 2009. p. 1589-1592. Disponível em: https://dl.acm.org/citation.cfm?id=1646179. Acesso em: 14 dez. 2019.

WANG, D.; LI, T. Document update summarization using incremental hierarchical clustering. In: ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT. 19. 2010, Toronto. Anais [...]. Nova Iorque: Association for Computing Machinery, 2010. p. 279-288. Disponível em: https://dl.acm.org/citation.cfm?id=1871476. Acesso em: 14 dez. 2019.