DOI: 10.17587/prin.14.311-328
Analysis of Modern Methods and Approaches to Object Detection in the Computer Vision Task
V. V. Shvyrov, Associate Professor, slsh@i.ua, D. A. Kapustin, Associate Professor, kap-kapchik@mail.ru, FSPU HE Lugansk State Pedagogical University, Lugansk, 91011, Lugansk Peoples Republic
Corresponding author: Denis A. Kapustin, Associate Professor, FSPU HE "Lugansk State Pedagogical University", Lugansk, 91011, Lugansk Peoples Republic, kap-kapchik@mail.ru
Received on May 16, 2023
Accepted on June 01, 2023
Object detection is one of the most important tasks of technical vision, which is actively used in various applied fields. This causes an extremely large number of publications, which are related to research in the field of object detection in an image or in a video stream, image classification, semantic segmentation, etc. due to the rapid growth of the entire field of artificial intelligence and the emergence of numerous new methods and approaches.
In this paper, we analyzed a corpus of 5792 English-language publications on the subject of object detection for 2018-2023. The key tendencies and directions in the field of object detection are determined. In particular, data were obtained on the data sets used, high-level neural network frameworks, current architectures for feature extraction and neural network architectures used in the object detection unit. Based on the data of the frequency analysis of the corpus of publications, trends and priority areas in the field of object detection over the past 5 years have been identified.
The results of the work should provide answers to the following questions:
- Q1. What image datasets are used in object detection tasks?
- Q2. What high-level frameworks for object detection are relevant at the moment?
- Q3. What architectures for feature extraction (backbone-architectures) are the most relevant at the moment?
- Q4. What neural network architectures for object detection are relevant and what is the actual rating for using various object detection methods?
Keywords: backbone architecture, datasets, frequency analysis, object detection, computer vision, convolutional neural network, pattern recognition, Python, semantic analysis, neural network frameworks
pp. 311–328
For citation:
Shvyrov V. V., Kapustin D. A. Analysis of Modern Methods and Approaches to Object Detection in the Computer Vision Task, Programmnaya Ingeneria, 2023, vol. 14, no. 7, pp. 311—328. DOI: 10.17587/prin.14.311-328. (in Russian)
References:
- Zou Z., Chen K., Shi Z., Guo Y., Ye J. Object Detection in 20 Years: A Survey, Proceedings of the IEEE, 2023, vol. 111, no. 3, pp. 257—276. DOI: 10.1109/JPROC.2023.3238524.
- Viola P., Jones M. Rapid object detection using a boosted cascade of simple features, CVPR, IEEE, 2001, vol. 1, pp. 511—518. DOI: 10.1109/CVPR.2001.990517.
- Viola P., Jones M. Robust real-time face detection, International journal of computer vision, 2004, vol. 57, no. 2, pp. 137—154. DOI: 10.1023/B:VISI.0000013087.49260.fb.
- Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, 2012, pp. 1097—1105. DOI: 10.1145/3065386.
- Girshick R., Donahue J., Darrell T., Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, 2014, pp. 580—587. DOI: 10.1109/CVPR.2014.81.
- Girshick R., Donahue J., Darrell T., Malik J. Regionbased convolutional networks for accurate object detection and segmentation, IEEE transactions on pattern analysis and machine intelligence, 2016, vol. 38, no. 1, pp. 142—158. DOI: 10.1109/ TPAMI.2015.2437384.
- He K., Zhang X., Ren S., Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition, ECCV, Springer, 2014, pp. 346—361. DOI: 10.1007/978-3-319-10578-9_23.
- Girshick R. Fast R-CNN, ICCV, 2015, pp. 1440—1448. DOI: 10.1109/ICCV.2015.169.
- Ren S., He K., Girshick R., Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks, Advances in neural information processing systems, 2015, pp. 91—99. DOI: 10.1109/TPAMI.2016.2577031.
- Lin T.-Y., Dollar P., Girshick R. B. et al. Feature pyramid networks for object detection, CVPR, 2017, vol. 1, no. 2, pp. 4. DOI: 10.1109/CVPR.2017.106.
- Redmon J., Divvala S., Girshick R., Farhadi A. You only look once: Unified, real-time object detection, CVPR, 2016, pp. 779—788. DOI: 10.1109/CVPR.2016.91.
- Liu W., Anguelov D., Erhan D. et al. SSD: Single shot multibox detector, ECCV. Springer, 2016, pp. 21—37. DOI: 10.1007/978-3-319-46448-0_2.
- Vaswani A., Shazeer N., Parmar N. et al. Attention is all you need, Advances in Neural Information Processing Systems, 2017, pp. 5998—6008.
- Carion N., Massa F., Synnaeve G. et al. End-to-end object detection with transformers, European Conference on Computer Vision, Springer, 2020, pp. 213—229. DOI: 10.1007/978-3-030-58452-8_13.
- Liu L., Ouyang W., Wang X. et al. Deep Learning for Generic Object Detection: A Survey // International Journal of Computer Vision, 2019, vol. 128, pp. 261—318. DOI: 10.1007/s11263-019-01247-4.
- Search | arXiv e-print repository, available at: https://arxiv. org/search/?searchtype=all&query=%22object+detection%22&abstr acts=show&size=200&order=-announced_date_first (date of access 28.04.2023).
- Deerwester S., Dumais S. T., Fumas G W., Landau-er Т. К., Harshman R. Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 1990, vol. 41, pp. 391—407. DOI: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
- Foltz P. Latent Semantic Indexing for textbased research, Behavior Research Methods, Instruments, and Computers, 1996, vol. 28, pp. 197—202.
- Foltz P., Kintsch W., Landauer T. K. The measurement of textual coherence with Latent Semantic Analysis, Discourse Processes, 1998, vol. 25, pp. 285—307. DOI: 10.1080/01638539809545029.
- Shvyrov V. V., Korop G. V., Nechay T. A., Shishlakova V. N. Study of Methods and Approaches for Solving the SLAM Problem Using Statistical Analysis of the Corpus of English-Language Publications, Bulletin of the Luhansk State Pedagogical University: Sat. scientific tr., Lugansk: Knita, Series 5. Humanities. Technical science, 2022, vol. 3, no. 89, pp. 75—89.
- Kapustin D. A., Shvyrov V. V., Shulika T. I. Static Analysis of the Source Code of Python Applications, Programmnaya Ingeneria, 2022, vol. 13, no. 8, pp. 394—403. DOI: 10.17587/prin.13.394-403 (in Russian).
- beautifulsoup4 4.12.2, available at: https://pypi.org/project/ beautifulsoup4/ (date of access 28.04.2023).
- Salton G., McGill M. J. Introduction to Modern Information Retrieval, McGraw-Hill Book Co., New York, 1983.
- Ipsen G. The Ancient Orient and Indogermans Feast Scipts for W. Streitburg, Heidelberg, 1924, pp. 30—45.
- Shur G. S. Field theories in linguistics: a monography, Moscow, Nauka, 1974, 254 p. (in Russian).
- Admoni V. G. Syntax of modern German language: A system of relations and build system, Leningrad, Nauka, 1973, 366 p.
- Everingham M., Gool L. V., Williams C., Winn J., Zisser-man A. The pascal visual object classes (voc) challenge, IJCV, 2010, vol. 88 (2), pp. 303—338. DOI: 10.1007/s11263-009-0275-4.
- Deng J., Dong W., Socher R. et al. ImageNet: A large-scale hierarchical image database, 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248—255. DOI: 10.1109/CVPR.2009.5206848.
- Lin T., Maire M., Belongie S. et al. Microsoft COCO: Common objects in context, ECCV, 2014, pp. 740—755. DOI: 10.1007/978-3-319-10602-1_48.
- Machine Learning Datasets, available at: https://paperswith-code.com/datasets (date of access 28.04.2023).
- Geiger A., Lenz P., Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite, 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 3354—3361. DOI: 10.1109/CVPR.2012.6248074.
- Caesar H., Bankiti V., Lang A. H. et al. nuScenes: A mul-timodal dataset for autonomous driving, CVPR, 2020, pp. 11618— 11628. DOI: 10.1109/cvpr42600.2020.01164.
- Kuznetsova A., Rom H., Alldrin N. G. et al. The Open Images Dataset V4, International Journal of Computer Vision, 2018, vol. 128, pp. 1956—1981. DOI: 10.1007/s11263-020-01316-z.
- Xia G., Bai X., Ding J. et al. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images, 2018IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 3974—3983. DOI: 10.1109/CVPR.2018.00418.
- Krishna R., Zhu Y., Groth O. et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, International Journal of Computer Vision, 2016, vol. 123, pp. 32—73. DOI: 10.1007/s11263-016-0981-7.
- Shao S., Zhao Z., Li B. et al. CrowdHuman: A Benchmark for Detecting Human in a Crowd, ArXiv, 2018, abs/1805.00123.
- Shao S., Li Z., Zhang T. et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8429— 8438. DOI: 10.1109/ICCV.2019.00852.
- Collobert R., Bengio S., Mariethoz J. Torch: a modular machine learning software library, Technical Report IDIAP-RR, 2002, IDIAP, pp. 2—46.
- Bergstra J., Breuleux O., Bastien F. et al. Theano: A CPU and GPU Math Expression Compiler, Proceedings of the 9th Python in Science Conference, 2010, pp. 18—24. DOI: 10.25080/MAJORA-92BF1922-003.
- Jia Y., Shelhamer E., Donahue J. et al. Caffe: Convolutional Architecture for Fast Feature Embedding, Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 675—678. DOI: 10.1145/2647868.2654889.
- Abadi M., Agarwal A., Barham P. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems, ArXiv, 2016, abs/1603.04467.
- TensorFlow, available at: https://www.tensorflow.org/ (date of access 28.04.2023).
- Paszke A., Gross S., Massa F. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library, ArXiv, 2019, abs/1912.01703.
- PyTorch, available at: https://pytorch.org/ (date of access28.04.2023).
- Darknet: Open source neural networks in c, available at: http://pjreddie.com/darknet (date of access 30.05.2023).
- Chen T., Li M., Li Y. et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems, ArXiv, 2015, abs/1512.01274.
- Huawei Technologies Co., Ltd. Huawei MindSpore AI Development Framework, Artificial Intelligence Technology, Springer, Singapore, 2023. DOI: 10.1007/978-981-19-2879-6_5.
- Keras: Deep Learning for humans, available at: https:// keras.io/ (date of access 30.05.2023).
- Detectron, available at: https://github.com/facebookre-search/Detectron/ (date of access 30.05.2023).
- Chen K., Wang J., Pang J. et al. MMDetection: Open MMLab Detection Toolbox and Benchmark, ArXiv, 2019, abs/1906.07155.
- Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks, NIPS, 2012, pp. 1097—1105. DOI: 10.1145/3065386.
- Szegedy C., Liu W., Jia Y. et al. Going deeper with convolutions, CVPR, 2015, pp. 1—9. DOI: 10.1109/CVPR.2015.7298594.
- Hu J., Shen L., Sun G. Squeeze-and-excitation networks, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7132—7141. DOI: 10.1109/CVPR.2018.00745.
- Li W., Wang L., Li W.,Agustsson E., Gool L. WebVision Database: Visual Learning and Understanding from Web Data, ArXiv, 2017, abs/1708.02862.
- Elharrouss O., Akbari Y., Almaadeed N., Al-ma'adeed S. Backbones-Review: Feature Extraction Networks for Deep Learning and Deep Reinforcement Learning Approaches, ArXiv, 2022, abs/2206.08016. DOI: 10.48550/arXiv.2206.08016.
- Lin M., Chen Q., Yan S. Network In Network, CoRR, 2014, abs/1312.4400.
- Simonyan K., Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition, CoRR, 2014, abs/1409.1556.
- He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016, pp. 770—778. DOI: 10.1109/cvpr.2016.90.
- Tan M., Le Q. V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, 9—15 June 2019, pp. 6105—6114.
- Elsken T., Metzen J. H., Hutter F. Neural Architecture Search: A Survey, Journal of Machine Learning Research. 2018, vol. 20, pp. 55:1—55:21. arXiv:1808.05377.
- Tan M., Pang R., Le Q. EfficientDet: Scalable and Efficient Object Detection, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 10778—10787. DOI: 10.1109/CVPR42600.2020.01079.
- Du X., Lin T., Jin P. et al. SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 11589—11598. DOI: 10.1109/ CVPR42600.2020.01161.
- Zhou X., Wang D., Kriihenbiihl P. Objects as Points, ArXiv, 2019, abs/1904.07850.
- Qin Z., Li Z. Zhang Z. et al. ThunderNet: Towards Real-Time Generic Object Detection on Mobile Devices, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 6717—6726. DOI: 10.1109/ICCV.2019.00682.
- Wang C., Liao H. M., Yeh I. et al. CSPNet: A New Backbone that can Enhance Learning Capability of CNN, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 2020, pp. 1571—1580. DOI: 10.1109/CVPRW50498.2020.00203.
- Xie S., Girshick R., Dollar P. et al. Aggregated residual transformations for deep neural networks, Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492—1500. DOI: 10.1109/CVPR.2017.634.
- Liu Y., Zhang Y., Wang Y. et al. A Survey of Visual Transformers, IEEE transactions on neural networks and learning systems, 2023, pp. 1—21. DOI: 10.1109/tnnls.2022.3227717.
- Dosovitskiy A., Beyer L., Kolesnikov A. et al. An image is worth 16x16 words: Transformers for image recognition at scale, ICLR, 2020, ArXiv, abs/2010.11929.
- Liu Z., Lin Y., Cao Y. et al. Swin transformer: Hierarchical vision transformer using shifted windows 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 9992—10002. DOI: 10.1109/ICCV48922.2021.00986.
- Fedus W., Zoph B., Shazeer N. M. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, ArXiv, 2021, abs/2101.03961.
- Liu Z., Hu H., Lin Y. et al. Swin Transformer V2: Scaling Up Capacity and Resolution, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11999—12009. DOI: 10.1109/CVPR52688.2022.01170.
- Zhai X., Kolesnikov A., Houlsby N., Beyer L. Scaling Vision Transformers, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 1204—1213. DOI: 10.1109/ CVPR52688.2022.01179.
- An Overview of Object Detection Models | Papers With Code, available at: https://paperswithcode.com/methods/category/ object-detection-models (date of access 30.05.2023).