Informacionnye Tehnologii, 2023, vol. 29, no. 12, pp. 650-663

Ðóññêèé

ABSTRACTS OF ARTICLES OF THE JOURNAL "INFORMATION TECHNOLOGIES".
No. 12. Vol. 29. 2023

DOI: 10.17587/it.29.650-663

U. A. Grigorev, Dr. of Sc., Professor,
Bauman Moscow State Technical University, Moscow, 105005, Russian Federation

Estimating the Cardinality of Queries Based on a Sample from a Full Outer Join of Tables

Cardinality estimation (CardEst) plays an important role in creating high-quality query execution plans in the DBMS. In the last decade, a large number of methods have been developed: traditional methods (histograms, samples), machine learning methods based on queries or data. But all of them are based on different restrictions and assumptions, and the cardinality estimation with their help worsens with an increase in the number of joined tables. The article proposes two new methods based on the theory of approximate calculation of aggregates and allowing to remove most of the restrictions. Method 1 fetches blocks after executing the subqueries of the original query and does not require a preliminary analysis of the filter conditions. The condition for joining tables can be arbitrary (not necessarily the equality of attributes). Method 2 allows you to calculate the probabilities of reading blocks based on the metadata accumulated in the process of populating the database. Metadata takes up little memory, and the overhead of maintaining it is low. Method 2, in contrast to method 1, takes into account the true cardinality of the connection of neighboring blocks in the sample obtained from metadata. Therefore, the prospect opens up for a more accurate assessment of the cardinality of joining a large number of tables. Method 1 (EVACAR method) is implemented and compared with modern machine learning methods BayesCard, DeepDB, FLAT on a special STATS test. The results of the experiments confirmed the effectiveness of the EVACAR method. The EVACAR method is more accurate or its maximum q-error is comparable to machine learning methods for 75-88 % of evaluated queries (subplans). In the future, it is planned to implement the 2nd method for assessing the cardinality of queries.
Keywords: cardinality estimation, CardEst, sampling, full outer join, approximate calculation of aggregates

P. 650-663

References

Han Y., Wu Z., Wu P., Zhu R., Yang J., Tan L. W., Zeng K., Cong G., Qin Y., Pfadler A., Qian Z., Zhou J., Li J., Cui B. Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation, Proceedings of the VLDB Endowment, 2021, vol. 15, no. 4, pp. 752—765.
Gunopulos D., Kollios G., Tsotras V. J., Domeniconi C. Selectivity estimators for multidimensional range queries over real attributes, The VLDB Journal, 2005, vol. 14, pp. 137—154.
Khachatryan A., Mtiller E., Stier C., Bohm K. Improving accuracy and robustness of self-tuning histograms by subspace clustering, IEEE Transactions on Knowledge and Data Engineering, 2015, vol. 27, no. 9, pp. 2377—2389.
Stillger M., Lohman G. M., Markl V., Kandil M. LEO-DB2's learning optimizer, VLDB, 2001, vol. 1, pp. 19—28.
Wu C., Jindal A., Amizadeh S., Patel H., Le W., Qiao S., Rao S. Towards a learning optimizer for shared clouds, Proceedings of the VLDB Endowment, 2018, vol. 12, no. 3, pp. 210—222.
Heimel M., Kiefer M., Markl V. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1477—1492.
Kiefer M., Heimel M., BreB S., Markl V. Estimating join selectivities using bandwidth-optimized kernel density models, Proceedings of the VLDB Endowment, 2017, vol. 10, no. 13, pp. 2085—2096.
Leis V., Radke B., Gubichev A., Kemper A., Neumann T. Cardinality Estimation Done Right: Index-Based Join Sampling, CIDR 2017 — 8th Biennial Conference on Innovative Data Systems Research, 2017, pp. 1—8.
Li F., Wu B., Yi K., Zhao Z. Wander join: Online aggregation via random walks, Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 615—629.
Zhao Z., Christensen R., Li F., Hu X., Yi K. Random sampling over joins revisited, Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 1525—1539.
Cai W., Balazinska M., Suciu D. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities, Proceedings of the 2019 International Conference on Management of Data, 2019, pp. 18—35.
Kipf A., Kipf T., Radke B., Leis V., Boncz P., Kemper A. Learned cardinalities: Estimating correlated joins with deep learning, CIDR 19 — 9th Biennial Conference on Innovative Data Systems Research, 2019, pp. 1—8.
13. Dutt A., Wang C., Nazi A., Kandula S., Narasayya V., Chaudhuri S. Selectivity estimation for range predicates using lightweight models, Proceedings of the VLDB Endowment, 2019, vol. 12, no. 9, pp. 1044—1057.
14. Yang Z., Kamsetty A., Luan S., Liang E., Duan Y., Chen X., Stoica I. NeuroCard: one cardinality estimator for all tables, Proceedings of the VLDB Endowment, 2020, vol. 14, no. 1, pp. 61—73.
15. Wu Z., Shaikhha A., Zhu R., Zeng K., Han Y., Zhou J. Bayescard: Revitilizing bayesian frameworks for cardinality estimation, arXiv preprint arXiv:2012.14743, 2020.
Hilprecht B., Schmidt A., Kulessa M., Molina A., Kers-ting K., Binnig C. DeepDB: Learn from Data, not from Queries!, Proceedings of the VLDB Endowment, 2020, vol. 13, no. 7, pp. 992—1005.
Zhu R., Wu Z., Han Y., Zeng K., Pfadler A., Qian Z., Zhou J., Cui B. FLAT: fast, lightweight and accurate method for cardinality estimation, Proceedings of the VLDB Endowment, 2021, vol. 14, no. 9, pp. 1489—1502.
Wu Z., Yu P., Yang P., Zhu R., Han Y., Li Y., Lian D., Zeng K., Zhou J. A unified transferable model for ml-enhanced dbms, arXiv preprint arXiv.2105.02418, 2021.
Zhang X., Wang J., Yin J. Sapprox: enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling, Proceedings of the VLDB Endowment, 2016, vol. 10, no. 3, pp. 109—120.
10. Burdakov A., Grigorev U., Ploutenko A., Ermakov O. Approximate Query Processing for Lambda Architecture, Proceedings of the 6th International Conference on Internet of Things, Big Data and Security, 2021, vol. 1, pp. 253—261.
Grigorev U., Ploutenko A., Burdakov A., Ermakov O. Comparison of Data Sampling Strategies for Approximate Processing of Queries to a Large Database, Informatsionnyye Tekhnologii, 2022, vol. 28, no. 5, pp. 240—249 (in Russian).
Grigorev U. Evaluation method the cardinality of tables joins in relational DBMS, Informatika i Sistemy Upravleniya, 2023, no. 1, pp. 3—15 (in Russian).
Zhao Z., Christensen R., Li F., Hu X., Yi K. Random sampling over joins revisited, Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 1525—1539.
Matthew N., Stones R. Accessing PostgreSQL from C Using libpq, Beginning Databases with PostgreSQL: From Novice to Professional, 2005, pp. 385—417.
Leis V., Gubichev A., Mirchev A., Boncz P., Kemper A., Neumann T. How good are query optimizers, really?, Proceedings of the VLDB Endowment, 2015, vol. 9, no. 3, pp. 204—215.

To the contents