Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

1Adobe Research 2Tripo AI 3Hillbot
4Oregon State University

Abstract

Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at 960×540 resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the 950×540 resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.



Video comparison with Long-LRM







Architecture of Long-LRM++

Long-LRM++ takes up to 64 input images at 950×540 resolution along with their camera poses, and processes them using a backbone composed of interleaved Mamba2 and Transformer blocks, similar to Long-LRM. Each image token predicts $K$ free-moving feature Gaussians (visualized with originating pixel's color). During rendering, we introduce a multi-space partitioning step that divides the Gaussians into multiple subsets, each rendered and decoded independently. The target-frame decoder incorporates translation-invariant local-attention blocks to improve robustness and rendering quality. Finally, the decoded feature maps are merged and passed through a linear layer to produce the novel-view color or depth rendering.

BibTeX

@inproceedings{ziwen2025llrm2,
  title={Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction},
  author={Ziwen, Chen and Tan, Hao and Wang, Peng and Xu, Zexiang and Fuxin, Li},
  booktitle={ArXiv},
  year={2025}
}