Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N1 2025 year

DOI: 10.17587/prin.17.14-23
Review of Deep Learning Models for Cross-View Geo-Localization Tasks Accounting for Contextual Information
T. V. Yermolenko, Associate Professor, naturewild71@yandex.ru, V. I. Bondarenko, Associate Professor, mail@vibondarenko.ru, Donetsk State University, Donetsk, 283001, Donetsk People's Republic, Russian Federation
Corresponding author: Tatyana V. Yermolenko, Associate Professor, Donetsk State University, 283001, Donetsk, Donetsk People's Republic, Russian Federation, E-mail: naturewild71@yandex.ru
Received on June 27, 2025
Accepted on August 06, 2025

This paper reviews state-of-the-art deep learning models designed for cross-geolocation tasks and presents a comparative analysis of their performance. We address key challenges in applying machine learning to autonomous UAV navigation systems, including variations in imaging conditions, occlusions, the lack of annotated datasets, and class imbalance. The primary architectures for comparing aerial images (UAV and satellite) are convolutional networks. To capture local features and contextual information from all image regions, these networks employ image partitioning techniques. A notable convolutional model for context-aware processing is the Local Pattern Network, which uses a square-ring partitioning strategy. Transformers, with their self-attention mechanism, serve as powerful extractors of context-dependent features. The Vision Transformer (ViT) architecture—adapted for image analysis—processes all image patches through Transformer layers. Its attention mechanism iteratively refines patch embeddings, progressively incorporating semantic relationships between them. Building on ViT, we propose an architecture that segments images into regions based on heatmap distributions and aligns UAV imagery with satellite data. This approach demonstrates robustness to object displacement and scale variations across image pairs. To further improve matching accuracy, multimodal models such as the Multi-Branch Feature Fusion Network (MBF) are employed. MBF combines convolutional and Transformer strengths by integrating natural language descriptions of objects alongside visual data. However, multimodal approaches face limitations, including scarce multimodal datasets and the high computational cost of large base networks, resulting in slow inference speeds.

Keywords: cross-view geo-localization, local pattern network, vision transformer, feature segmentation and region alignment, cross-modality feature fusion, multi-branch fusion network
pp. 14—23
For citation:
Yermolenko T. V., Bondarenko V. I. Review of Deep Learning Models for Cross-View Geo-Localization Tasks Accounting for Contextual Information, Programmnaya Ingeneria, 2026, vol. 17, no. 1, pp. 14—23. DOI: 10.17587/prin.17.14-23. (in Russian).
References:
  1. Sonkar S. Transfer Learning: Leveraging Pre-Trained Models for Limited Datasets Through Fine-Tuning, International Journal of Information Technology and Management Information Systems, 2025, vol. 16, no. 2, pp. 564—576. DOI: 10.34218/IJIT-MIS_16_02_037.
  2. Ali B., Sadekov R. N., Tsodokova V. V. UAV Navigation Algorithms Using Computer Vision Systems, Giroskopiya i Navigatsiya, 2022, vol. 30, no. 4 (119), pp. 87—105 (in Russian).
  3. Lin J., Zheng Z., Zhong Z. et al. Joint Representation Learning and Keypoint Detection for Cross-view Geo-localization, IEEE Transactions on Image Processing, 2022, vol. 31, pp. 4560—4572. DOI: 10.1109/TIP.2022.3175601.
  4. Wang X., Girshick R., Gupta A., He K. Non-local neural networks, Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'18), Salt Lake City, UT, USA, IEEE, 2018, pp. 7794—7803. DOI: 10.1109/CVPR.2018.00813.
  5. Vaswani A., Shazeer N., Parmar N. et al. Attention is all you need, Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, Curran Associates, 2017, pp. 5998—6008.
  6. Wang T., Zheng Z., Yan C. et al. Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization, IEEE Transactions on Circuits and Systems for Video Technology, 2020, vol. 32, no. 2, pp. 867—879. DOI: 10.1109/TCSVT.2021.3061265.
  7. Chopra S., Hadsell R., LeCun Y. Learning a similarity metric discriminatively, with application to face verification, Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, IEEE,2005, vol. 1, pp. 539—546. DOI: 10.1109/CVPR.2005.202.
  8. Simonyan K., Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition, International Conference on Learning Representations (ICLR), 2015, available at: https://arxiv.org/abs/1409.1556 (date of access 01.12.2025).
  9. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'16), Las Vegas, NV, USA, IEEE, 2016, pp. 770—778. DOI: 10.1109/CVPR.2016.90.
  10. Dosovitskiy A., Beyer L., Kolesnikov A. et al. An image is worth 16x16 words: Transformers for image recognition at scale, 10th International Conference on Learning Representations (ICLR 2021), Virtual, 2021, pp. 1—21, available at: https://openreview.net/forum?id=YicbFdNTTy (date of access 01.12.2025).
  11. Tian Y., Chen C., Shah M. Cross-view image matching for geolocalization in urban environments, Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17), Honolulu, HI, USA, IEEE, 2017, pp. 1998—2006.
  12. Altwaijry H., Trulls E., Hays J. et al. Learning to match aerial images with deep attentive architectures, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'16), Las Vegas, NV, USA, IEEE, 2016, pp. 3539—3547.DOI: 10.1109/CVPR.2016.385.
  13. Cai S., Guo Y., Khan S. et al. Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss, Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV'19), Seoul, Korea, IEEE, 2019, pp. 8390—8399. DOI: 10.1109/ICCV.2019.00848.
  14. Jaderberg M., Simonyan K., Zisserman A., Kavukcuoglu K. Spatial transformer networks, Advances in Neural Information Processing Systems (NIPS 2015), Montreal, Canada, Curran Associates, 2015, pp. 2017—2025.
  15. Ren S., He K., Girshick R., Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, vol. 39, no. 6, pp. 1137—1149. DOI: 10.1109/TPAMI.2016.2577031.
  16. Rodrigues R., Tani M. Are these from the same place? Seeing the unseen in cross-view image geo-localization, Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, IEEE, 2021, pp. 3752—3760. DOI: 10.1109/WACV48630.2021.00380.
  17. Yang H., Lu X., Zhu Y. Cross-view geo-localization with layer-to-layer transformer, Advances in Neural Information Processing Systems (NeurIPS 2021), Virtual, Curran Associates, 2021, pp. 1—12.
  18. Dai M., Hu J., Zhuang J., Zheng E. A. Transformer-Based Feature Segmentation and Region Alignment Method for UAV-View Geo-Localization, IEEE Transactions on Circuits and Systems for Video Technology, 2021, vol. 32, no. 7, pp. 4376—4389. DOI: 10.1109/TCSVT.2021.3135013.
  19. Schroff F., Kalenichenko D., Philbin J. FaceNet: A Unified Embedding for Face Recognition and Clustering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815—823. DOI: 10.1109/CVPR.2015.7298682.
  20. Chen Y. C., Li L., Yu L. et al. UNITER: Universal image-text representation learning, Computer Vision — ECCV 2020: 16th European Conference, Glasgow, UK, August 23—28, 2020, Proceedings, Part XVI, Berlin, Heidelberg, Springer, 2020, pp. 104—120. DOI: 10.1007/978-3-030-58577-8_7.
  21. Huang Z., Zeng Z., Huang Y. et al. Seeing out of the box: End-to-end pre-training for vision-language representation learning, Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'21), Virtual, IEEE, 2021, pp. 12976—12985. DOI: 10.1109/CVPR46437.2021.01278.
  22. Kim W., Son B., Kim I. ViLT: Vision-and-language transformer without convolution or region supervision, Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, PMLR, 2021, pp. 5583—5594.
  23. Tan H., Bansal M. LXMERT: Learning cross-modality encoder representations from transformers, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, China, Association for Computational Linguistics, 2019, pp. 5100—5111. DOI: 10.18653/v1/D19-1514.
  24. Radford A., Kim J. W., Hallacy C. et al. Learning transferable visual models from natural language supervision, Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, PMLR, 2021, pp. 8748—8763.
  25. Talreja V., Valenti M. C., Nasrabadi N. M. Deep hashing for secure multimodal biometrics, IEEE Transactions on Information Forensics and Security, 2020, vol. 16, pp. 1306—1321. DOI: 10.1109/TIFS.2020.3033189.
  26. Lin T. Y., RoyChowdhury A., Maji S. Bilinear CNN models for fine-grained visual recognition, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, IEEE, 2015, pp. 1449—1457. DOI: 10.1109/ICCV.2015.170.
  27. Wang X., Li S., Chen C. et al. Data-level recombination and lightweight fusion scheme for RGB-D salient object detection, IEEE Transactions on Image Processing, 2020, vol. 30, pp. 458—471. DOI: 10.1109/TIP.2020.3037470.
  28. George A., Marcel S. Cross-modal focal loss for RGB-D face anti-spoofing, Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'21), Virtual, IEEE, 2021, pp. 7882—7891. DOI: 10.1109/CVPR46437.2021.00779.
  29. Zheng A., Wang Z., Chen Z. et al. Robust multi-modality person re-identification, Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), Virtual, AAAI Press, 2021, vol. 35, no. 4, pp. 3529—3537.
  30. Alay N., Al-Baity H. Deep learning approach for multi-modal biometric recognition system based on fusion of iris, face, and finger vein traits, Sensors, 2020, vol. 20, no. 19, article 5523. DOI: 10.3390/s20195523.
  31. Soleymani S., Dabouei A., Kazemi H. et al. Multi-level feature abstraction from convolutional neural networks for multimodal biometric identification, Proceedings of the 24th International Conference on Pattern Recognition (ICPR 2018), Beijing, China: IEEE, 2018, pp. 3469—3476. DOI: 10.1109/ICPR.2018.8545061.
  32. Zhu R., Yang M., Yin L. et al. UAV's status is worth considering: A fusion representations matching method for geo-localization, Sensors, 2023, vol. 23, no. 2, article 720. DOI: 10.3390/s23020720.
  33. He K., Zhang X., Ren S., Sun J. Identity mappings in deep residual networks, Computer Vision — ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11—14, 2016, Proceedings, Part IV, Berlin, Heidelberg: Springer, 2016, pp. 630— 645. DOI: 10.1007/978-3-319-46493-0_38.
  34. Chopra S., Hadsell R., LeCun Y. Learning a similarity metric discriminatively, with application to face verification, Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, IEEE, 2005, vol. 1, pp. 539—546. DOI: 10.1109/CVPR.2005.202.
  35. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR'16), Las Vegas, NV, USA, IEEE, 2016, pp. 770—778. DOI: 10.1109/CVPR.2016.90.
  36. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, Association for Computational Linguistics, 2019, vol. 1, pp. 4171—4186. DOI: 10.18653/v1/N19-1423.
  37. Zheng Z., Wei Y., Yang Y. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization, Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), 2020, pp. 1—9. DOI: 10.1145/3394171.3413896.