main| new issue| archive| editorial board| for the authors| publishing house|
Ðóññêèé
Main page
New issue
Archive of articles
Editorial board
For the authors
Publishing house

 

 


ABSTRACTS OF ARTICLES OF THE JOURNAL "INFORMATION TECHNOLOGIES".
No. 5. Vol. 28. 2022

DOI: 10.17587/it.28.240-249

U. A. Grigorev1, Dr. of Sci., Professor, A. D. Ploutenko2, Dr. of Sci., Rector, A. V. Burdakov1, Ph.D., Researcher, O. Y. Ermakov1, Postgraduate Student,
1Bauman Moscow State Technical University, Moscow, 105005, Russian Federation
2Amur State University, Blagoveschensk, 675027, Russian Federation

Comparison of Data Sampling Strategies for Approximate Processing of Queries to a Large Database

Interactive analytics requires database systems to promptly process aggregation queries. As the volume of data continues to grow at a high rate, this becomes more and more challenging. In the past, the database community has come up with two ideas to solve this problem: Approximate Query Processing (AQP) and Aggregate Precomputation (AggPre). In this article, we will focus on APQ. Many techniques were proposed for optimizing AQP performance. Most of them are based on the creation of the best stratified samples. But none of the existing approaches can answer the following questions: "Under what conditions will the sampling strategy be optimal? What is the maximum error in estimating an aggregate when using the strategy in case of data skew?" The article provides answers to the questions posed. The article discusses three strategies for selecting segments from a data array: 1) uniform, 2) based on the distribution of read records by segments, 3) based on the distribution of read records and aggregate values by segments. The article defines conditions for the optimality of each strategy. Proposed analytical methods solve the optimization problem and determine the conditions under which the value of the confidence interval of the estimated aggregate will be minimal. The paper also studied the effect of data skews on the accuracy of the estimation of aggregated values. The maximum value of the confidence interval is calculated at the border of the admissible region. Experiments on depersonalized banMng transactions demonstrated that the 3rd sampling strategy is much better than the first two strategies.
Keywords: Approximate Query Processing, AQP, aggregation, sampling strategies

P. 240–249

To the contents