Journal "Software Engineering" (Programmnaya Ingeneria) | A Cross-Attention and Deformable Convolution Approach for Generating Invariant Descriptors in Visual Geolocation Tasks

Main

New Issue

Archive

Most cited articles

Editor in chief

Editorial board

For the authors

Publishing ethic

Peer reviewing

Publishing House

Old site

Russian

Issue N4 2026 year

DOI: 10.17587/prin.17.191-200

A Cross-Attention and Deformable Convolution Approach for Generating Invariant Descriptors in Visual Geolocation Tasks

S. A. Morozov, Master's Student, serzh51.51@yandex.ru, D. E. Rokhlin, Master's Student, emptyline.13@yandex.ru, E. A. Skorniakova, Cand. Sc. (Eng.), Associate Professor, skorniakova_ea@voenmeh.ru, Baltic State Technical University "VOENMEH" named after D. F. Ustinov, Saint Petersburg, 190005, Russian Federation

Corresponding author: Sergey A. Morozov, Master's Student, Baltic State Technical University "VOENMEH" named after D. F. Ustinov, Saint Petersburg, 190005, Russian Federation, E-mail: serzh51.51@yandex.ru

Received on November 15, 2025

Accepted on December 18, 2025

This article addresses the critical task of visual geolocalization for unmanned aerial vehicles (UAVs) in GNSS-denied environments. It focuses on overcoming the limitations of classical feature-based methods, which are highly sensitive to photometric and textural variations in aerial images captured at different times and under different conditions. The paper proposes a comprehensive approach based on a novel convolutional neural network, CADE-Net, which introduces an integration of an adaptive architecture, a cross-attention mechanism, and a hybrid loss function. The core challenge lies in matching a current onboard aerial image against a georeferenced satellite or aerial map from a database despite significant changes caused by varying seasons, weather, and lighting conditions. The proposed CADE-Net architecture is specifically designed to tackle these challenges. It employs a two-stream structure with deformable convolutions to achieve geometric invariance to scale and rotation. Furthermore, a cross-attention mechanism is integrated between the streams to explicitly model the relationships between image pairs, enabling the network to focus on semantically correspondent regions despite their visual differences. The training process is enhanced by a hybrid loss function that combines metric learning principles with an adversarial approach. A key innovation is the intelligent mining strategy for hard negatives, which forces the model to learn fine-grained details by distinguishing between structurally similar but semantically different objects. Proposed approach can generate robust, invariant features that are resilient to complex, non-linear distortions. This work is of significant practical importance for developing fully autonomous and reliable UAV navigation systems capable of operating effectively without GNSS signals.

Keywords: visual geolocalization, unmanned aerial vehicle (UAV), GNSS, convolutional neural networks (CNN), image matching, invariant features, deep learning, robust descriptors, image-based navigation

pp. 191—200

For citation:
Morozov S. A., Rokhlin D. E., Skorniakova E. A. A Cross-Attention and Deformable Convolution Approach for Generating Invariant Descriptors in Visual Geolocation Tasks, Programmnaya Ingeneria, 2026, vol. 17, no. 4, pp. 191—200. DOI: 10.17587/prin.17.191-200. (in Russian).

References:

Wang X., Kealy A., Li W. et al. Toward Autonomous UAV Localization via Aerial Image Registration, Electronics, 2021, vol. 10, no. 4, article 435. DOI: 10.3390/electronics10040435.
Schleiss M. Translating aerial images into street-map-like representations for visual self-localization of uavs, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2019, vol. XLII-2/W13, pp. 575—582. DOI: 10.5194/ isprs-archives-XLII-2-W13-575-2019.
Silva Filho P., Shiguemori E. H., Saotome O. UAV visual autolocalizaton based on automatic landmark recognition, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2017, vol. IV-2/W3, pp. 89—94. DOI: 10.5194/isprs-annals-IV-2-W3-89-2017.
He H., Chen M., Chen T., Li D. Matching of Remote Sensing Images with Complex Background Variations via Siamese Convolutional Neural Network, Remote Sensing, 2018, vol. 10, no. 2, article 355. DOI: 10.3390/rs10020355.
Chibunichev A. G., Kobzev A. A. Investigation of the possibility of joint photogrammetric processing of multi-temporal aerial photographs, Izvestiya VUZov. Geodesy and Aerophotosurveying, 2021, vol. 65, no. 3, pp. 292—301 (in Russian).
Chen J., Xie H., Zhang L. el al. SAR and Optical Image Registration Based on Deep Learning with Co-Attention Matching Module, Remote Sensing, 2023, vol. 15, no. 14, article 3548. DOI: 10.3390/rs15153879.
Antipova N. V., Gvozdev O. G., Kozub V. A. et al. Restoration of Structural Information on Anthropogenic Objects from Single Aerospace Images, Journal of Computer and Systems Sciences International, 2023, vol. 62, no. 3, pp. 90—105. DOI: 10.31857/ S0002338823030010.
He S., Zhou R., Li S. et al. Disparity Estimation of High-Resolution Remote Sensing Images with Dual-Scale Matching Network, Remote Sensing, 2021, vol. 13, no. 24, article 5050. DOI: 10.3390/rs13245050.
Harvey W., Rainwater C., Cothren J. Direct Aerial Visual Geolocalization Using Deep Neural Networks, Remote Sensing, 2021, vol. 13, no. 19, article 4017. DOI: 10.3390/rs13194017.
Isola P., Zhu J. Y., Zhou T., Efros A. A. Image-to-Image Translation with Conditional Adversarial Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5967—5976. DOI: 10.1109/CVPR.2017.632.
Chen P., Fu Y., Hu J. et al. An Adaptive Remote Sensing Image-Matching Network Based on Cross-Attention and Deformable Convolution, Electronics, 2023, vol. 12, no. 13, article 2889. DOI: 10.3390/electronics12132889.
Chen Y., Jiang J. A Two-Stage Deep Learning Registration Method for Remote Sensing Images Based on Sub-Image Matching, Remote Sensing, 2021, vol. 13, no. 17, article 3443. DOI: 10.3390/rs13173443.