Abstract

Recently, the Segment Anything Model (SAM) has showcased remarkable capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has gained popularity as a method for various 3D problems beyond novel view synthesis. Though there exist initial attempts to incorporate these two methods into 3D segmentation, they face the challenge of accurately and consistently segmenting objects in complex scenarios. In this paper, we introduce the Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high quality 3D segmentation of any object in a given scene. SANeRF-HQ utilizes SAM for open-world object segmentation guided by user-supplied prompts, while leveraging NeRF to aggregate information from different viewpoints. To overcome the aforementioned challenges, we employ density field and RGB similarity to enhance the accuracy of segmentation boundary during the aggregation. Emphasizing on segmentation accuracy, we evaluate our method quantitatively on multiple NeRF datasets where high-quality ground-truths are available or manually annotated. SANeRF-HQ shows a significant quality improvement over previous state-of-the-art methods in NeRF object segmentation, provides higher flexibility for object localization, and enables more consistent object segmentation across multiple views.

overview

Video

Mask Aggregation

Previous works directly decode masks from distilled SAM features, or use self-prompting to iteratively inverse render SAM masks. The former suffers from aliasing in the rendered SAM features and cannot produce consistent masks across views, while the latter may accumulate errors due to incorrect self-prompting. Hence, we propose to aggregate the imperfect SAM 2D masks in the 3D space to generate high-quality and consistent 3D masks.

Ray-Pair RGB Loss

Segmentation errors in both 3D and 2D are more likely to occur at object boundaries. One observation is that humans usually distinguish object boundaries by the color and texture difference on the two sides. Here we introduce the Ray-Pair RGB loss, aiming to incorporate color and spatial information to improve the segmentation quality.


We sample rays from regions with high training error and reproject the 3D surface points to different training views to gather image patches. The Ray-Pair RGB Loss is then applied to a set of reference rays and all other rays in these patches. This allows us to regularize the object field using appearance from multiple views simultaneously.

Additional Comparison

Other Prompts

Our method also support text prompts using Grounding-DINO, and automatically segmenting everything in the scene using a grid of point prompts:

Dynamic NeRF

We present a preliminary demonstration on the easy extension of our method to 4D dynamic NeRF representations:

Citation

Acknowledgements

The website template was borrowed from Michaƫl Gharbi and Ref-NeRF