ECCV 2026

CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction

Sparse indoor panoramas in, metric depth, camera poses, and full-house point clouds out.

Yuzhou Ji* Xiaotian Yang* Zhipeng Zhang

School of Artificial Intelligence, Shanghai Jiao Tong University

*Equal contribution. Corresponding author.

Paper PDF Explore Point Clouds Code Coming Soon
CasaMaestro teaser comparing dense pinhole capture with sparse panoramic reconstruction
20-50 sparse panoramas
0.56s demonstrated reconstruction speed
0.927 AUC@30 on Realsee-Syn
0.078 overall Realsee AbsRel

Abstract

Fast metric reconstruction for real homes.

The rise of home-deployed embodied AI systems creates a growing need for fast, metric 3D reconstruction of residential spaces. Pinhole-camera pipelines struggle with large indoor residences because narrow fields of view require dense capture and long alignment chains. CasaMaestro addresses this by taking only twenty to fifty sparse multi-view indoor panoramas and directly predicting metric depth with camera poses, enabling immediate point-cloud reconstruction of an entire house with full coverage.

CasaMaestro combines a multi-view DINO backbone, a panoramic camera pose decoder, and ERP data augmentation. Experiments show robust high-quality results on both real-world and synthetic scenes, making sparse panoramic capture a practical foundation for house-scale indoor assets and closed-loop simulation.

Motivation

Panoramas reduce capture density without sacrificing coverage.

Pinhole models either lose context under sparse capture or accumulate drift across dense sequences. CasaMaestro uses sparse panoramic viewpoints to keep the full house visible while avoiding long incremental alignment chains.

Comparison of pinhole sparse capture, dense scanning, and CasaMaestro panoramic reconstruction performance

Method

A compact feedforward pipeline.

CasaMaestro pipeline with ERP augmentation, multi-view DINO, DPT depth head, and pose head
01

Multi-view panorama tokens

Each panorama is treated as a view and encoded by a DINOv2-based backbone with local and cross-view attention.

02

Panoramic pose decoding

A lightweight cross-view pose head aligns condensed view features to predict translation and quaternion rotation.

03

Metric depth back-projection

DPT-style depth prediction and camera extrinsics are fused into metrically consistent house-scale point clouds.

04

ERP augmentation

Equirectangular remapping generates additional pose-view pairs while preserving scene geometry.

Supplement Figure 1

Axis-wise max loss focuses supervision on the worst pose axis.

The supplementary comparison shows that uniform mean aggregation can hide large errors on one axis, while axis-wise max loss keeps optimization focused on the weakest pose component.

Supplementary Figure 1 comparing uniform mean pose loss with axis-wise max pose loss
Comparison of loss type from the supplementary material.

Interactive Point Clouds

Rotate through sampled CasaMaestro reconstructions.

The viewer loads lightweight RGB samples derived from the provided PLY scenes. Drag to orbit, scroll to zoom, and switch between houses below.

Scene

Loading point cloud...
- sampled points
- source vertices
- scene extent

Results

Stable pose and depth across house-scale scenes.

Qualitative reconstruction comparison on Realsee-Syn
Realsee-Syn qualitative comparison
Qualitative reconstruction comparison on Realsee-Real
Realsee-Real qualitative comparison

Pose on Realsee-Real

0.903

AUC@30, with 2.608 rotation mean and 2.132 translation mean.

Pose on Realsee-Syn

0.927

AUC@30, outperforming prior methods under sparse panoramic capture.

Depth on Realsee

0.078

Overall AbsRel with 0.205 RMSE and 0.972 delta1.

Zero-shot depth

21.98%

Average AbsRel decrease compared with the best previous results.

Citation

BibTeX

@inproceedings{ji2026casamaestro,
  title     = {CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction},
  author    = {Ji, Yuzhou and Yang, Xiaotian and Zhang, Zhipeng},
  booktitle = {European Conference on Computer Vision},
  year      = {2026}
}