Abstract
Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
Community
Depth Anything 3 (DA3) uses a simple Transformer and a single depth-ray target to handle multi-view geometry without needing camera poses.
i want to code about this paper
Hi @naveediqbal15 - You can find their model on the right side of this page and GitHub under the abstract. Feel free to contribute!
arXiv explained breakdown of this paper π https://arxivexplained.com/papers/depth-anything-3-recovering-the-visual-space-from-any-views
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper