ECCV 2026

CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction

Yuzhou Ji^* Xiaotian Yang^* Zhipeng Zhang^†

AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University

^*Equal contribution. ^†Corresponding author.

Looping navigation demo for reconstructed synthetic scene 03521 — Instant simulation construction from only panoramas for embodied systems. *Point cloud is downsampled here for visibility.

Abstract

Fast metric reconstruction for real homes.

The rise of home-deployed embodied AI systems creates a growing need for fast, metric 3D reconstruction of residential spaces. Pinhole-camera pipelines struggle with large indoor residences because narrow fields of view require dense capture and long alignment chains. CasaMaestro addresses this by taking only twenty to fifty sparse multi-view indoor panoramas and directly predicting metric depth with camera poses, enabling immediate point-cloud reconstruction of an entire house with full coverage.

CasaMaestro combines a multi-view DINO backbone, a panoramic camera pose decoder, and ERP data augmentation. Experiments show robust high-quality results on both real-world and synthetic scenes, making sparse panoramic capture a practical foundation for house-scale indoor assets and closed-loop simulation.

CasaMaestro teaser comparing dense pinhole capture with sparse panoramic reconstruction

20-50 sparse panoramas

0.56s demonstrated reconstruction speed

0.927 AUC@30 on Realsee-Syn

0.078 overall Realsee AbsRel

Motivation

Panoramas reduce capture density without sacrificing coverage.

Pinhole models either lose context under sparse capture or accumulate drift across dense sequences. CasaMaestro uses sparse panoramic viewpoints to keep the full house visible while avoiding long incremental alignment chains.

Comparison of pinhole sparse capture, dense scanning, and CasaMaestro panoramic reconstruction performance

Method

A compact feedforward pipeline.

CasaMaestro pipeline with ERP augmentation, multi-view DINO, DPT depth head, and pose head

Multi-view panorama tokens

Each panorama is treated as a view and encoded by a DINOv2-based backbone with local and cross-view attention.

Panoramic pose decoding

A lightweight cross-view pose head aligns condensed view features to predict translation and quaternion rotation.

Metric depth back-projection

DPT-style depth prediction and camera extrinsics are fused into metrically consistent house-scale point clouds.

ERP augmentation

Equirectangular remapping generates additional pose-view pairs while preserving scene geometry.

Interactive Meshes

Check reconstructed house structure.

Downsampled reconstruction points are converted into compact surfel meshes for smoother browser playback. Drag to orbit, scroll to zoom, and switch between houses below.

Loading mesh...

- mesh surfels

- source points

- scene extent

Results

Stable pose and depth across house-scale scenes.

Qualitative reconstruction comparison on Realsee-Syn — Realsee-Syn qualitative comparison

Qualitative reconstruction comparison on Realsee-Real — Realsee-Real qualitative comparison

Pose on Realsee-Real

0.903

AUC@30, with 2.608 rotation mean and 2.132 translation mean.

Pose on Realsee-Syn

0.927

AUC@30, outperforming prior methods under sparse panoramic capture.

Depth on Realsee

0.078

Overall AbsRel with 0.205 RMSE and 0.972 delta1.

Zero-shot depth

21.98%

Average AbsRel decrease compared with the best previous results.

Supplement Figure 1

Axis-wise max loss focuses supervision on the worst pose axis.

The supplementary comparison shows that uniform mean aggregation can hide large errors on one axis, while axis-wise max loss keeps optimization focused on the weakest pose component.

Supplementary Figure 1 comparing uniform mean pose loss with axis-wise max pose loss — Comparison of loss type from the supplementary material.

Supplement Figure 2

Failure case on Zillow Indoor Dataset (ZInD).

The complete lack of connecting viewpoints between adjacent rooms can lead to poor predictions. This suggests the remaining failures may require substantially more data scale or stronger spatial layout awareness to resolve.

Supplementary Figure 2 showing a failure case on Zillow Indoor Dataset with top-down pose overlay — Failure case on Zillow Indoor Dataset (ZInD).

Citation

BibTeX

@inproceedings{ji2026casamaestro,
  title     = {CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction},
  author    = {Ji, Yuzhou and Yang, Xiaotian and Zhang, Zhipeng},
  booktitle = {European Conference on Computer Vision},
  year      = {2026}
}