Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N4 2026 year

DOI: 10.17587/prin.17.191-200
A Cross-Attention and Deformable Convolution Approach for Generating Invariant Descriptors in Visual Geolocation Tasks
S. A. Morozov, Master's Student, serzh51.51@yandex.ru, D. E. Rokhlin, Master's Student, emptyline.13@yandex.ru, E. A. Skorniakova, Cand. Sc. (Eng.), Associate Professor, skorniakova_ea@voenmeh.ru, Baltic State Technical University "VOENMEH" named after D. F. Ustinov, Saint Petersburg, 190005, Russian Federation
Corresponding author: Sergey A. Morozov, Master's Student, Baltic State Technical University "VOENMEH" named after D. F. Ustinov, Saint Petersburg, 190005, Russian Federation, E-mail: serzh51.51@yandex.ru
Received on November 15, 2025
Accepted on December 18, 2025

This article addresses the critical task of visual geolocalization for unmanned aerial vehicles (UAVs) in GNSS-denied environments. It focuses on overcoming the limitations of classical feature-based methods, which are highly sensitive to photometric and textural variations in aerial images captured at different times and under different condi­tions. The paper proposes a comprehensive approach based on a novel convolutional neural network, CADE-Net, which introduces an integration of an adaptive architecture, a cross-attention mechanism, and a hybrid loss function. The core challenge lies in matching a current onboard aerial image against a georeferenced satellite or aerial map from a database despite significant changes caused by varying seasons, weather, and lighting conditions. The proposed CADE-Net architecture is specifically designed to tackle these challenges. It employs a two-stream structure with deformable convolutions to achieve geometric invariance to scale and rotation. Furthermore, a cross-attention mechanism is integrated between the streams to explicitly model the relationships between image pairs, enabling the network to focus on semantically correspondent regions despite their visual differences. The training process is enhanced by a hybrid loss function that combines metric learning principles with an adversarial approach. A key innovation is the intelligent mining strategy for hard negatives, which forces the model to learn fine-grained details by distinguishing between structurally similar but semantically different objects. Proposed approach can generate robust, invariant features that are resilient to complex, non-linear distortions. This work is of significant practical importance for developing fully autonomous and reliable UAV navigation systems capable of operating effectively without GNSS signals.

Keywords: visual geolocalization, unmanned aerial vehicle (UAV), GNSS, convolutional neural networks (CNN), image matching, invariant features, deep learning, robust descriptors, image-based navigation
pp. 191—200
For citation:
Morozov S. A., Rokhlin D. E., Skorniakova E. A. A Cross-Attention and Deformable Convolution Approach for Generating Invariant Descriptors in Visual Geolocation Tasks, Programmnaya Ingeneria, 2026, vol. 17, no. 4, pp. 191—200. DOI: 10.17587/prin.17.191-200. (in Russian).
References:
  1. Wang X., Kealy A., Li W. et al. Toward Autonomous UAV Localization via Aerial Image Registration, Electronics, 2021, vol. 10, no. 4, article 435. DOI: 10.3390/electronics10040435.
  2. Schleiss M. Translating aerial images into street-map-like representations for visual self-localization of uavs, The International Archives of the Photogrammetry, Remote Sensing and Spatial Informa­tion Sciences, 2019, vol. XLII-2/W13, pp. 575—582. DOI: 10.5194/ isprs-archives-XLII-2-W13-575-2019.
  3. Silva Filho P., Shiguemori E. H., Saotome O. UAV visual autolocalizaton based on automatic landmark recognition, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2017, vol. IV-2/W3, pp. 89—94. DOI: 10.5194/isprs-annals-IV-2-W3-89-2017.
  4. He H., Chen M., Chen T., Li D. Matching of Remote Sensing Images with Complex Background Variations via Siamese Convolutional Neural Network, Remote Sensing, 2018, vol. 10, no. 2, article 355. DOI: 10.3390/rs10020355.
  5. Chibunichev A. G., Kobzev A. A. Investigation of the possibility of joint photogrammetric processing of multi-temporal aerial photographs, Izvestiya VUZov. Geodesy and Aerophotosurveying, 2021, vol. 65, no. 3, pp. 292—301 (in Russian).
  6. Chen J., Xie H., Zhang L. el al. SAR and Optical Image Registration Based on Deep Learning with Co-Attention Matching Module, Remote Sensing, 2023, vol. 15, no. 14, article 3548. DOI: 10.3390/rs15153879.
  7. Antipova N. V., Gvozdev O. G., Kozub V. A. et al. Restoration of Structural Information on Anthropogenic Objects from Single Aerospace Images, Journal of Computer and Systems Sciences International, 2023, vol. 62, no. 3, pp. 90—105. DOI: 10.31857/ S0002338823030010.
  8. He S., Zhou R., Li S. et al. Disparity Estimation of High-Resolution Remote Sensing Images with Dual-Scale Matching Network, Remote Sensing, 2021, vol. 13, no. 24, article 5050. DOI: 10.3390/rs13245050.
  9. Harvey W., Rainwater C., Cothren J. Direct Aerial Visual Geolocalization Using Deep Neural Networks, Remote Sensing, 2021, vol. 13, no. 19, article 4017. DOI: 10.3390/rs13194017.
  10. Isola P., Zhu J. Y., Zhou T., Efros A. A. Image-to-Image Translation with Conditional Adversarial Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5967—5976. DOI: 10.1109/CVPR.2017.632.
  11. Chen P., Fu Y., Hu J. et al. An Adaptive Remote Sensing Image-Matching Network Based on Cross-Attention and Deformable Convolution, Electronics, 2023, vol. 12, no. 13, article 2889. DOI: 10.3390/electronics12132889.
  12. Chen Y., Jiang J. A Two-Stage Deep Learning Registration Method for Remote Sensing Images Based on Sub-Image Matching, Remote Sensing, 2021, vol. 13, no. 17, article 3443. DOI: 10.3390/rs13173443.