Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

High-quality large-scene 3D Gaussian reconstruction in 1.3 seconds.

Ziwen Chen1, Hao Tan2, Kai Zhang2, Sai Bi2, Fujun Luan2, Yicong Hong2, Fuxin Li1, Zexiang Xu2

1Oregon State University
2Adobe Research

Abstract

We propose Long-LRM, a generalizable 3D Gaussian reconstruction model that is capable of reconstructing a large scene from a long sequence of input images. Specifically, our model can process 32 source images at 960x540 resolution within only 1.3 seconds on a single A100 80G GPU. Our architecture features a mixture of the recent Mamba2 blocks and the classical transformer blocks which allowed many more tokens to be processed than prior work, enhanced by efficient token merging and Gaussian pruning steps that balance between quality and efficiency. Unlike previous feed-forward models that are limited to processing 1~4 input images and can only reconstruct a small portion of a large scene, Long-LRM reconstructs the entire scene in a single feed-forward step. On large-scale scene datasets such as DL3DV-140 and Tanks and Temples, our method achieves performance comparable to optimization-based approaches while being two orders of magnitude more efficient.

Long-LRM takes up to 32 input images along with their Plücker ray embeddings as model input, which are then patchified, linearly transformed, and concatenated into token sequences. These tokens are processed through an optional token merging module, followed by a sequence comprising Mamba2 blocks (×7) and a Transformer block (×1). This entire processing structure is repeated three times (×3) to ensure effective handling of the long-sequence inputs and comprehensive feature extraction. After processing, the tokens are unpatchified, and Gaussian pruning is applied to generate the final 3D GS representation. The bottom section of the figure illustrates the resulting novel view synthesis and wide-coverage Gaussian reconstruction, demonstrating Long-LRM's capability to handle extensive view coverage and produce high-quality, photorealistic reconstructions.

Video rendered from our wide-coverage Gaussians, reconstructed with 32 images in 1.3 seconds.
Open interactive viewer

BibTeX

@article{ziwen2024llrm,
  title={Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats},
  author={Ziwen, Chen and Tan, Hao and Zhang, Kai and Bi, Sai and Luan, Fujun and Hong, Yicong and Fuxin, Li and Xu, Zexiang},
  journal={arXiv preprint 2410.12781},
  year={2024}
}