Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N1 2023 year

DOI: 10.17587/prin.14.42-50
Problems of Automatic Processing of Scientific Texts based on Extraction of Information from Encyclopedias of Relevant Domain Areas
O. I. Bachishe, Undergraduate, bachisheo@yandex.ru, E. N. Kruchkova, Professor, kruchkova_elena@mail.ru, D. S. Shushakov, Undergraduate, shushakov4@yandex.ru, Altai State Technical University, Barnaul, 656038, Russian Federation
Corresponding author: Elena N. Kruchkova, Professor, Altai State Technical University, Barnaul, 656038, Russian Federation, E-mail: kruchkova_elena@mail.ru
Received on October 10, 2022
Accepted on October 25, 2022

The article discusses the problems arising in the automatic processing of scientific texts and presents the results of work on creating a combined method for aspect-oriented analysis of scientific texts in the field of fundamental disciplines, taking into account both knowledge of the subject area and statistical methods of text processing. Thematic encyclopedias, which are not only a source of professional scientific terminology, but are considered to be an information resource for extracting knowledge about the subject area, are proposed to be used as training data. The work offers the structure of templates designed to extract information from the partially structured text of the encyclopedia, considers the structure of extracted sets of professional terms, offers the algorithm of formation of semantic relationships between special terms. The process of knowledge extraction in this paper is demonstrated on the example of processing four encyclopedias: mathematical, physical, chemical, medical. The general principles of the formation of domain scientific terminology are highlighted, and statistical data on the terminological composition in each of the examined areas is given. Within the framework of the conducted research on the basis of the texts of encyclopedias the basic semantic graphs of the corresponding scientific fields with the relations between the professional terms introduced on them are constructed. Basic graphs accumulate knowledge about the scientific field and are intended for the subsequent thematic analysis of unstructured texts of scientific articles. The Implemented algorithm of extraction of semantics of the given scientific text is based both on amplification of weights of nodes — terms of the applied domain, and on the correction of semantic relations between the nodes of the graph according to the processed text. The results of experiments on automatic construction of the list of keywords of the article are given. The results were compared with the list of keywords specified by the author of the article. It should be noted that the relevance of correctly extracted terms is mainly determined by semantic links in the basic domain graph, and depends significantly less on the number of keywords in the original article, which demonstrates the advantage of the proposed combined method compared with a simple frequency analysis. The sample analysis of the texts of the articles on mathematics showed good accuracy in the extraction of key terms compared to the list of keywords specified by the author of the article.

Keywords: aspect-oriented analysis, scientific lexicon, semantic graph, scientific text classification
pp. 42–50
For citation:
Bachishe O. I., Kryuchkova E. N., Shushakov D. S. Problems of Automatic Processing of Scientific Texts based on Extraction of Information from Encyclopedias of Relevant Domain Areas, Programmnaya Ingeneria, 2023, vol. 14, no. 1 pp. 42—50.
References:
  1. Erera S., Shmueli-Scheuer M., Feigenblat G. et al. A Summarization System for Scientific Documents, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Hong Kong, China, 2019, pp. 211-216. DOI: 10.18653/v1/D19-3036.
  2. Dong Y., Mircea A., Cheung J. Discourse-Aware Unsupervised Summarization for Long Scientific Documents, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 1089-1102. DOI: 10.18653/v1/2021.eacl-main.93.
  3. Foppiano L., Romary L., Ishii M., Tanifuji M. Automatic Identification and Normalisation of Physical Measurements in Scientific Literature, In Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng 19), New York, NY, USA, 2019, part 24, pp. 1-4. DOI: 10.1145/3342558.3345411.
  4. Gusev V. D., Salomatina N. V. Iterative Template Construction Method for Searching for Information on Chemical Processes and Conditions in Catalysis Texts, Informacionnye i matematicheskie tekhnologii v nauke i upravlenii, 2016, no. 4-1, pp. 37-45 (in Russian).
  5. Tkaczyk D., Szostek P., Fedoryszak M., Dendek P., Bolikowski L. CERMINE: automatic extraction of structured metadata from scientific literature, International Journal on Document Analysis and Recognition (IJDAR), 2015, vol. 18, no. 4, pp. 317-335. DOI: 10.1007/s10032-015-0249-8.
  6. Epp S., Hoffmann M., Lell N., Mohr M., Scherp A. A Machine Learning Pipeline for Automatic Extraction of Statistic Reports and Experimental Conditions from Scientific Papers, 2021, available at: https://arxiv.org/pdf/2103.14124.pdf
  7. Ageev M.S., Dobrov B.V., Lukashevich N.V. Automatic Text Rubrics: Methods and Problems, Uchenye zapiski Kazanskogo gosudarstvennogo universiteta. Fizikomatematicheskie nauki, 2008, vol. 150, no. 4, pp. 25-40 (in Russian).
  8. Belikov V., Selegej V., Selegej D. Internet corpus as a tool for linguistic research: differentiality, authorization, and thematic biases (or corpuses you want to believe), Annual International Conference "Dialogue", 2020, iss. 19, pp. 62-75. DOI: 10.28995/2075-7182-2020-19-62-75 (in Russian).
  9. Vinogradov I.M. Mathematical Encyclopedia in 5 v, Moscow, Sov. encyclopedia, 1977 (in Russian).
  10. Bruches E.P., Pauls A.E., Batura T.V., Isachenko V.V. Study of Methods for Entity Recognition and Relation Extraction in Scientific Texts, Science and Artificial Intelligence conference (SAIC-2020), 2020, pp. 41-45. DOI: 10.1109/S.A.I.ence50533.2020.9303196
  11. Knunyanc I. L. Chemical Encyclopedia in 5 volumes, Moscow, Sov. encyclopedia, 1988-1999 (in Russian).
  12. Pokrovskij V. I. Small Medical Encyclopedia: in 6 v., Moscow, Sov. encyclopedia, Bolshaya Ros. enciklopediya, 1991—1996. (in Russian).
  13. Prohorov A. M. Physical Encyclopedia: in 5 v., Moscow, Sov. encyclopedia, Bolshaya Ros. enciklopediya, 1988—1998. (in Russian).
  14. Korney A., Kryuchkova E., Savchenko V. Information Retrieval Approach Using Semiotic Models Based on Multi-layered Semantic Graphs, High-Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production. HPCST 2020. Communications in Computer and Information Science, Springer, Cham., 2020, vol. 1304, pp. 162-177. DOI: 10.1007/978-3-030-66895-2_11.
  15. Robertson, S., Zaragoza H. The probabilistic relevance framework: BM25 and beyond, Foundations and Trends in Information Retrieval, 2009, vol. 3, no. 4, pp. 333-389. DOI: 10.1561/1500000019.
  16. Onal K. D., Zhang Y., Rahman I. S. Neural information retrieval: At the end of the early years, Information Retrieval Journal, 2018, vol. 21, no. 2-3, pp. 111-182. DOI: 10.1007/s10791-017-9321-y.
  17. Matematicheskij sbornik, scientific journal, Moscow, available at: http://www.mathnet.ru/msb (in Russian).
  18. Korney A.O., Kryuchkova E.N. Categorization of texts based on a condensed graph, Informacionnye tekhnologii, 2021, vol. 27, no. 3, pp. 138-146. DOI: 10.17587/it.27.138-146. (in Russian).
  19. Kryuchkova E., Korney A., Methods for Domain Adaptation of Automated Systems for Aspect Annotation of Customer Review Texts, High- Performance Computing Systems and Technologies in Scientific Research, Automation of Control and Production. HPCST 2020. Communications in Computer and Information Science. Springer, Cham, 2022, vol. 1526, pp. 336-348. DOI: 10.1007/978-3-030-94141-3_26.
  20. ThematicAnalysis, available at: https://github.com/bachisheo/ThematicAnalysis