Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

Abstract

Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating the benefits of feature alignment for downstream 3D reasoning.

Method

Using a pretrained VGGT backbone, we use predicted depth and camera parameters as pseudo-GT to align features obtained from a DPT adapter on top of VGGT image tokens. We sample query points and reproject these points to a target view using depth and camera parameters. Our loss function encourages the features at these two corresponding locations from source and target frames to be similar.

Qualitative Comparisons

We compare our method with AnySplat and WorldMirror on both DL3DV and RealEstate10K.

Feature Matching Visualization

We sample points in the first frame and use our aligned features to find correspondences in the subsequent frames.

Pseudo-GT Generation

We compare two different methods for generating pseudo-GT for geometric feature alignment. Please check supp for explanations.

(a) KNN Pseudo-GT

(b) Projection-Based Pseudo-GT

Acknowledgments

We would like to thank Xichen Pan and Noah Snavely for insightful discussions during the project, and Clément Godard, Michael Broxton, and Maggie Oh for help with compute support. Additionally, we thank Stephen Lombardi, Ryan Overbeck, and Jason Lawrence for helpful suggestions and feedback.

BibTeX

@article{deng2025selfi,
    title={Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment},
    author={Deng, Youming and Peng, Songyou and Zhang, Junyi and Heal, Kathryn and Sun, Tiancheng and Flynn, John and Marschner, Steve and Chai, Lucy},
    journal={arXiv preprint arXiv:2512.08930},
    year={2025}
}