Journal "Software Engineering" (Programmnaya Ingeneria) | Survey on Automatic Feature Construction in Machine Learning on Tabular Data

Main

New Issue

Archive

Most cited articles

Editor in chief

Editorial board

For the authors

Publishing ethic

Peer reviewing

Publishing House

Old site

Russian

Issue N6 2025 year

DOI: 10.17587/prin.16.300-310

Survey on Automatic Feature Construction in Machine Learning on Tabular Data

N. A. Radeev, Postgraduate Student, n.radeev@g.nsu.ru, Novosibirsk State University, Novosibirsk, 630090, Russian Federation

Corresponding author: Nikita A. Radeev, Postgraduate Student, Novosibirsk State University, Novosibirsk, 630090, Russian Federation E-mail: n.radeev@g.nsu.ru

Received on January 23, 2025

Accepted on April 15, 2025

Modern methods of automated feature construction for machine learning tasks on tabular data are explored in depth in this study. These methods are essential for improving model performance and simplifying the data preparation process, particularly in scenarios where domain expertise is unavailable. The research examines a broad range of approaches, starting from classical algorithms based on predefined transformations, such as Deep Feature Synthesis (DFS), FCTree, ExploreKit, AutoLearn, FICUS, FEADIS, and COGNITO. Additionally, it evaluates evolutionary methods like GP-DR, GP-MaL-MO, and DIFER, which use optimization techniques inspired by natural selection to discover complex feature interactions. Each method is analyzed based on its key characteristics, limitations, and practical applications. Special attention is given to techniques for evaluating the quality of constructed features and optimizing the feature space to ensure the most relevant features are retained. The study highlights how modern approaches effectively uncover non-linear relationships and generate informative features without relying on domain experts, making them suitable for a wide range of machine learning tasks. However, the analysis also points out scalability challenges, as these methods often struggle to handle the increasing volume and complexity of data. Furthermore, the research identifies a trade-off between local and global strategies for feature construction. Local methods excel at capturing subtle, dataset-specific patterns but may have difficulties generalizing to new data. In contrast, global methods, while more robust, sometimes fail to detect intricate local dependencies. Despite these challenges, the study underscores the transformative impact of automated feature construction methods on the machine learning pipeline. By significantly reducing the time and effort required for data preparation, these methods enable practitioners to focus on model development, paving the way for more accurate and efficient machine learning applications. Future research will likely address the computational challenges and further enhance the scalability of these innovative approaches.

Keywords: machine learning, feature construction, feature engineering, tabular data, feature space, genetic algorithms, dimensionality reduction, evolutionary algorithms

pp. 300—310

For citation:
Radeev N. A. Survey on Automatic Feature Construction in Machine Learning on Tabular Data, Programmnaya Ingeneria, 2025, vol. 16, no. 6, pp. 300—310. DOI: 10.17587/prin.16.300-310. (in Russian).

References:

Radeev N. A. Approach to Research Feature Interactions, 2023 IEEE 24th International Conference of Young Professionals in Electron Devices and Materials (EDM), IEEE, 2023, pp. 1720—1724. DOI: 10.1109/EDM58354.2023.10225150.
Turner C., Fuggetta A., Lavazza L., Wolf A. A conceptual basis for feature engineering, Journal of Systems and Software, 1999, vol. 49, no. 1. pp. 3—15. DOI: 10.1016/S0164-1212(99)00062-X.
Salazar R., Neutatz F., Abedjan Z. Automated feature engineering for algorithmic fairness, Proceedings of the VLDB Endowment, 2021, vol. 14, no. 9, pp. 1694—1702. DOI: 10.14778/3461535.3463474.
Gosiewska A., Kozak A., Biecek P. Simpler is better: Lifting interpretability-performance trade-off via automated feature engineering, Decision Support Systems, 2021, vol. 150, article 113556. DOI: 10.1016/j.dss.2021.113556.
LeDell E., Poirier S. H2O AutoML: Scalable Automatic Machine Learning, 7th ICML Workshop on Automated Machine Learning (AutoML), July 2020, available at: https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf (date of access 10.03.2025).
Erickson N., Mueller J., Shirkov A. et al. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data, arXiv, 2020. DOI: 10.48550/arXiv.2003.06505.
Olson R., Moore J. TPOT: A tree-based pipeline optimization tool for automating machine learning, Workshop on automatic machine learning 2016, 2016, pp. 66—74. DOI: 10.1007/978-3-030-05318-5_8.
Parmentier L., Nicol O., Jourdan L., Kessaci M. ТРОТ-SH: A faster optimization algorithm to solve the automl problem on large datasets, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence, 2019, pp. 471—478. DOI: 10.1109/IC-TAI.2019.00072.
Feurer M., Eggensperger K., Falkner S. et al. Auto-Sklearn 2.0. Hands-free automl via meta-learning, Journal of Machine Learning Research, 2022, No. 23 (261), pp. 1—61. DOI: 10.48550/arXiv.2007.04074.
Xin H., Kaiyong Z., Xiaowen C. AutoML: A survey of the state-of-the-art, Knowledge-Based Systems, 2021, vol. 212, рp. 106622. DOI: 10.1016/j.knosys.2020.106622.
Jiang Y., Bosch N., Baker R. et al. Expert feature-engineering vs. deep neural networks: which is better for sensor-free affect detection? Artificial Intelligence in Education: 19th International Conference, AIED 2018, London, UK, June 27—30, 2018, Proceedings, Part I. 2018, pp. 198—211. DOI: 10.1007/978-3-319-93843-1_15.
Tymoshenko K., Bonadiman D., Moschitti A. Convolutional neural networks vs. convolut ion kernels: Feature engineering for answer sentence reranking, Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, 2016, pp. 1268—1278. DOI: 10.18653/v1/N16-1152.
Kanter J. M., Veeramachaneni K. Deep feature synthesis: Towards automating data science endeavors, 2015 IEEE international conference on data science and advanced analytics (DSAA), IEEE, 2015, pp. 1—10. DOI: 10.1109/DSAA.2015.7344858.
Fan W. Generalized and heuristic-free feature construction for im-proved accuracy, Proceedings of the 2010 SIAM International Conference on Data Mining, SIAM, 2010, pp. 629—640. DOI: 10.1137/1.9781611972801.55.
Katz G., Shin R., Song D. ExploreKit: Automatic feature generation and selection, 2016 IEEE 16th international conference on data mining (ICDM), Barcelona, Spain, IEEE, 2016, pp. 979—984. DOI: 10.1109/ICDM.2016.0123.
Kaul A., Maheshwary S., Pudi V. AutoLearn — automated feature generation and selection, 2017 IEEE International Conference on data mining (ICDM), IEEE, 2017, pp. 217—226. DOI: 10.1109/ ICDM.2017.31.
Markovitch S. Rosenstein D. Feature generation using general constructor functions, Machine Learning, 2002, vol. 49, pp. 59—98. DOI: 10.1023/A:1014046307775.
Kent J. T. Information gain and a general measure of correlation, Biometrika, 1983, Vol. 70, no. 1, pp. 163—173. DOI: 10.2307/2335954.
Khurana U. Cognito: Automated feature engineering for super-vised learning, 2016 IEEE 16th international conference on data mining workshops (ICDMW), IEEE, 2016, pp. 1304—1307. DOI: 10.1109/ICDMW.2016.0190.
Koza J. R. Genetic programming as a means for programming computers by natural selection, Statistics and computing, 1994, no. 4, pp. 87—112.
Billard L., Diday E. Symbolic regression analysis, Classification, clustering, and data analysis: recent advances and applications, Springer, 2002, pp. 281—288. DOI: 10.1007/978-3-642-56181-8_31.
Uriot T. On genetic programming representations and fitness functions for interpretable dimensionality reduction, Proceedings of the Genetic and Evolutionary Computation Conference, 2022, pp. 458—466. DOI: 10.1145/3512290.3528849.
Zhu G. DIFER: differentiable automated feature engineering, International Conference on Automated Machine Learning, PMLR, 2022, 17 p. DOI: 10.48550/arXiv.2010.08784.
Lensen A., Zhang M., Xue B. Multi-objective genetic programming for manifold learning: balancing quality and dimensionality, Genetic Programming and Evolvable Machines, 2020, vol. 21, no. 3, pp. 399—431. DOI: 10.1007/s10710-020-09375-4.
Lensen A., Zhang M., Xue B. Can genetic programming domanifold learning too? Genetic Programming: 22nd European Conference Proceedings, Springer, 2019, pp. 114—130. DOI: 10.1007/978-3-030-16670-0_8.
Zhang Q., Li H. MOEA/D: A multiobjective evolutionary algorithm based on decomposition, IEEE Transactions on evolutionary computation, 2007, vol. 11, no. 6, pp. 712—731. DOI: 10.1109/TEVC.2007.892759.
Abdi H., Williams L. J. Principal component analysis, Wiley interdisciplinary reviews: computational statistics, 2010, vol. 2, no. 4, pp. 433—459. DOI: 10.1002/wics.101.
Roweis S. T., Saul L. K. Nonlinear dimensionality reduction by locally linear embedding, Science, 2000, vol. 290, no. 5500, pp. 2323—2326. DOI: 10.1126/science.290.5500.2323.
Torgerson W. S. Multidimensional scaling: I. Theory and method, Psychometrika, 1952, vol. 17, no. 4, pp. 401—419. DOI: 10.1007/BF02288916.
Healy J., McInnes L. Uniform manifold approximation and projection, Nature Reviews Methods Primers, 2024, vol. 4, no. 1, p. 82. DOI: 10.21105/joss.00861.