In this paper, we propose an object recognition system that provides a complete understanding of the main objects in an indoor scene from a single 360º image in equirectangular projection. Our method extends the BlitzNet model to perform both object detection and semantic segmentation tasks but adapted to match the nature of the equirectangular image input. We train the network to predict 14 different classes of main indoor scenes related objects. The detection and segmentation predictions are post-processed to obtain instance segmentation masks, which are successfully refined by taking advantage of the room layout. In this work, we not only show the potential of exploiting the 2D room layout to improve the instance segmentation mask, but also the possibility of leveraging the 3D layout to generate 3D object bounding boxes directly from the improved masks.
A re-implementation of our work in Tensorflow 2.0 is available
here!
In this work we use 666 indoor panoramas from the SUN360 dataset
Zhang et al.. Here, we extend the database with segmentation labels. For every panorama, we generate individual masks encoding each object's spatial layout. Additionally, we combine all the masks obtaining a semantic segmentation panoramic image with per-pixel classification.
By managing the inherent characteristics and challenges that equirectangular panoramas involve, our method clearly outperforms state of the art.
Here we present qualitative results of our CNN Panoramic BlitzNet with EquiConvs for both object detection and semantic segmentation tasks.