SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration

Abstract

Reliable image correspondences form the foundation of vision-based spatial perception, enabling recovery of 3D structure and camera poses. However, unconstrained feature matching across domains such as aerial, indoor, and outdoor scenes remains challenging due to large variations in appearance, scale and viewpoint. Feature matching has been conventionally formulated as a 2D-to-2D problem; however, recent 3D foundation models provides spatial feature matching properties based on two-view geometry. While powerful, we observe that these spatially coherent matches often concentrate on dominant planar regions, e.g., walls or ground surfaces, while being less sensitive to fine-grained geometric details, particularly under large viewpoint changes. To better understand these trade-offs, we first perform linear probe experiments to evaluate the performance of various vision foundation models for image matching. Building on these insights, we introduce SPIDER, a universal feature matching framework that integrates a shared feature extraction backbone with two specialized network heads for estimating both 2D-based and 3D-based correspondences from coarse to fine. Finally, we introduce an image-matching evaluation benchmark that focuses on unconstrained scenarios with large baselines. SPIDER significantly outperforms SoTA methods, demonstrating its strong ability as a universal image-matching method.

SPIDER

Given two input images $I^A$ and $I^B$, our method builds on 3D VFM features and ConvNet features to combine semantic alignment and geometric consistency. A dual-head architecture operates in a coarse-to-fine manner: (1) the descriptor head aggregates multi-scale features through attention-based Fusion Gates to produce geometry-aware descriptors and confidence maps; (2) the warp head predicts dense correspondence fields and confidence maps, progressively refined across multiple scales. Final correspondences are sampled from the predicted warp and fastNN.

Evaluation on In-domain Benchmarks and Zero-Shot Benchmark

We evaluate our method on three two-view geometry datasets covering outdoor, indoor, and aerial-to-ground scenarios and further test generalization under a zero-shot setting, where all test datasets are unseen during training. RoMa, which is only trained and validated on MegaDepth, generalizes worse on ZEB. Our method achieves strong zero-shot performance, achieving +5.7, +6.3, +2.3 in MUL, SCE, ICL compared to the next best, and an overall +5.7 compared to Aerial-MAST3R, which SPIDER uses as for its backbone.

SoTA comparison on existing image matching benchmarks, measured by AUC@5. G.R + A.M denotes the concatenation of GIM-RoMa and Aerial-MASt3R. The best result is marked with best and second best.

Unconstrained Evaluation

We collected images from four distinct scenes to evaluate matching robustness in unconstrained environments: one synthetic city scene with multiple symmetric buildings (\emph{Urban}), one real scene from the ULTRRA Challenge on a cluster of warehouses (\emph{Warehouses}), and two real scenes on a single campus building (\emph{Campus 1 and 2}). All scenes contain complex building structures; while only \emph{Urban} has perfect pose groundtruth, all real scenes are calibrated using COLMAP constrained by RTK-grade GPS.

Visualization of camera positions for four multi-elevation scenes collected in unconstrained scenarios.

We evaluate using unconstrained matching with camera pose error AUC@20. SPIDER achieves the highest average performance. VGGT and MASt3R failed on this aerial-to-ground benchmark. Again, RoMa demonstrates impressive accuracy in \emph{Campus 1 and 2} aerial-to-ground matching without training on Aerial-MegaDepth. However, in \emph{Urban}'s ground-to-ground matching, MASt3R is much better than RoMa, demonstrating robustness to visual ambiguities in geometry-based matching. Even in aerial-to-ground scenarios, RoMa fails in multi-building scenarios in \emph{Urban and Warehouses} due to visual ambiguities from similar buildings. VGGT exhibits severe out-of-domain failures, likely because its feedforward camera pose prediction is domain-specific and is not generalizable enough compared to COLMAP.

SoTA comparison on two-view geometry. For *Urban* scene, both Aerial-Ground (A+G) and Ground-Ground (G+G) pairs are evaluated.

Visualization

Visual Comparison under unconstrained settings. Image pattern-driven methods, e.g. RoMa, finds diverse matches across many planes; however, matches may be false negatives on two sides of the building. Geometry-driven methods are better at matching planes. This can lead to homography if a confident plane dominates, e.g., when Aerial-MASt3R matches the wrong signs in Urban with high confidence. SPIDER combines both approaches and produces diverse and accurate matches.

Acknowledgements

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 140D0423C0076. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

BibTeX


    @article{shao2025spider,
        title={SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration},
        author={Shao, Zhimin and Yadav, Abhay and Chellappa, Rama and Peng, Cheng},
        journal={arXiv preprint arXiv:2511.17750},
        year={2025}
        }