Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N6 2025 year

DOI: 10.17587/prin.16.280-291
The Use of the Missing Sample Simulation Modeling to Create a Classification Model for 3 or more Classes with the Example of the Carbohydrate Metabolism Disorder Degree Detecting Problem
R. S. Novikov1, 2, Senior Researcher, rnovikov@ec-leasing.ru, M. A. Novopashin1, Head of the Math Cardiology Department, mnovopashin@ec-leasing.ru, B. A. Pozin1, 2, 3, Technical Director, bpozin@ec-leasing.ru,
1 Closed Joint Stock Company "EC — Leasing", Moscow, 117587, Russian Federation,
2 HSE University, Moscow, 101000, Russian Federation,
3 Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow, 109004, Russian Federation
Corresponding author: Roman S. Novikov, Senior Researcher, Closed Joint Stock Company "EC — Leasing", 117587, Moscow, Russian Federation, E-mail: rnovikov@ec-leasing.ru
Received on January 29, 2025
Accepted on March 25, 2025

The aim of this study is to develop a method for constructing a sorting model capable of classifying 3 or more ordinal classes in conditions of insufficient labeled data for sorting model construction using machine learning methods. When searching or collecting data in order to build machine learning models, there are situations when there is enough data to train a binary to recognize the "top" class and the "bottom" one relative to the class order, but not enough to train model to recognize other classes. To build a new classification (sorting) model that recognizes 3 or more ordinal classes, a data sampling simulation technique is proposed, based on the available information about the distribution of all classes in the general population and about frequency of a positive result occurrence in the existing binary classifier for each ordinal class. The purpose of data sampling simulation is to generate a labeled sample of random (according to the known class distribution) class objects and randomly (according to the known positive result occurrence frequency the existing binary classifier for each olass) generate measurements for each object — the results of the binary classifier ("positive"/ "negative result") — to form a sorting model based on this sample. The sorting model classifies an object by analyzing the proportion of the binary classifier positive results on all measurements of the object. The problem of building this model is to find the optimal ranges for positive results proportion in the binary classifier for each class. As an example of the result of using the sorting model constructing method via data sampling simulation, a sorting model is presented that solves the carbohydrate metabolism disorder degree detecting problem ("Type 2 diabetes mellitus"/"Prediabetes"/"Healthy") using an ECG series with a sufficiently high quality with at least 11 ECG measurements per patient.

Keywords: sorting, simulation modeling, machine learning, carbohydrate metabolism disorder
pp. 280—291
For citation:
Novikov R. S., Novopashin M. A., Pozin B. A. The Use of the Missing Sample Simulation Modeling to Create a Classification Model for 3 or more Classes with the Example of the Carbohydrate Metabolism Disorder Degree Detecting Problem, Programmnaya Ingeneria, 2025, vol. 16, no. 6, pp. 280—291. DOI: 10.17587/prin.16.280-291. (in Russian).
References:
  1. Furems E. M. Stepclass-based approach to multicrite-ria sorting, Iskusstvennyj intellekt i prinyatie reshenij, 2012, no. 4, pp. 104—115. DOI: 10.3103/S0147688215060064 (in Russian).
  2. Dedov I. I., Shestakova M. V., Mayorov A. Y. et al. Clinical recommendations. Type 2 diabetes mellitus in adults, Saharnyj diabet, 2020. vol. 23, no. S2, pp. 4—102. DOI: 10.14341/DM20202S (in Russian).
  3. Dedov I. I., Shestakova M. V., Galstyan G. R. The prevalence of type 2 diabetes mellitus in the adult population of Russia (NATION study), Saharnyj diabet, 2016, vol. 19, no. 2, pp. 104—112. DOI: 10.14341/DM2004116-17 (in Russian).
  4. Otdel'nova K. A. Determination of the required number of observations in social and hygienic research, Sb. trudov 2-go MMI, 1980, vol. 150, no. 6, pp. 18—22 (in Russian).
  5. Paniotto V. I., Maksimenko V. S. Quantitative methods in sociological research, Kyiv, Naukova dumka, 1982, 272 p. (in Russian).
  6. Cundill B., Alexander N. D. E. Sample size calculations for skewed distributions, BMC Medical Research Methodology, 2015, vol. 15, no. 1, article 28. DOI: 10.1186/s12874-015-0023-0.
  7. Muthen L. K., Muthen B. O. How to use a Monte Carlo study to decide on sample size and determine power, Structural Equation Modeling a Multidisciplinary Journal, 2002, vol. 9, no. 4, pp. 599—620. DOI: 10.1207/S15328007SEM0904_8.
  8. Figueira J. R., Mousseau V., Roy B. ELECTRE methods, Multiple criteria decision analysis: State of the art surveys, 2016, pp. 155—185.
  9. Jacquet-Lagreze E., Siskos J. Assessing a set of additive utility functions for multicriteria decision-making, the UTA method, European journal of operational research, 1982, vol. 10, no. 2, pp. 151—164. DOI: 10.1016/0377-2217(82)90155-2.
  10. Shmid A., Mkrtumyan A. Remote noninvasive detection of carbohydrate metabolism disorders by first-lead ECG screening in CardioQVARK project, 2019 Actual Problems of Systems and Software Engineering (APSSE), IEEE, 2019, pp. 139—145. DOI: 10.1109/ APSS47353.2019.00025.
  11. Schmid A. V., Berezin A. A., Novopashin M. A., Novikov R. S., Pozin B. A., Mkrtumyan A. M., Markova T. N. Komp'yuterizirovannyj sposob neinvazivnogo vyyavleniya narush-enij uglevodnogo obmena po elektrokardiogramme [Computerized method for non-invasive detection of carbohydrate metabolism disorders by electrocardiogram]: Patent RF no. 2728869, 2020.
  12. Novopashin M. A., Shmid A. V., Berezin A. A. Fermi-Pasta-Ulam auto recurrence in the description of the electrical activity of the heart, Medical hypotheses, 2017, vol. 101, pp. 12—16. DOI: 10.1016/j.mehy.2017.02.002.