MVImgNet: A Large-scale Dataset of Multi-view Images
CVPR 2023

overview

Abstract

Being data-driven is one of the most iconic properties of deep learning algorithms. The birth of ImageNet drives a remarkable trend of "learning from large-scale data" in computer vision. Pretraining on ImageNet to obtain rich universal representations has been manifested to benefit various 2D visual tasks, and becomes a standard in 2D vision. However, due to the laborious collection of real-world 3D data, there is yet no generic dataset serving as a counterpart of ImageNet in 3D vision, thus how such a dataset can impact the 3D community is unraveled. To remedy this defect, we introduce MVImgNet, a large-scale dataset of multi-view images, which is highly convenient to gain by shooting videos of real-world objects in human daily life. It contains 6.5 million frames from 219,188 videos crossing objects from 238 classes, with rich annotations of object masks, camera parameters, and point clouds. The multi-view attribute endows our dataset with 3D-aware signals, making it a soft bridge between 2D and 3D vision.

We conduct pilot studies for probing the potential of MVImgNet on a variety of 3D and 2D visual tasks, including radiance field reconstruction, multi-view stereo, and view-consistent image understanding, where MVImgNet demonstrates promising performance, remaining lots of possibilities for future explorations.

Besides, via dense reconstruction on MVImgNet, a 3D object point cloud dataset is derived, called MVPNet, covering 87,200 samples from 150 categories, with the class label on each point cloud. Experiments show that MVPNet can benefit the real-world 3D object classification while posing new challenges to point cloud understanding.

MVImgNet and MVPNet will be publicly available, hoping to inspire the broader vision community.

Dataset -- MVImgNet

Statistics

The statistics of MVImgNet are shown in Tab. 1. MVImgNet includes 238 object classes, from 6.5 million frames of 219,188 videos. Fig. 1 shows some frames randomly sampled from MVImgNet. The annotation comprehensively covers object masks, camera parameters, and point clouds.

Tab. 1: The statistic of the data amount generated from the pipeline, and valid amount after our cleaning, also the GPU hours for the processing.


Fig.1: A variety of multi-view images in MVImgNet.

Category taxonomy

Fig.2: Taxonomy figure of the MVImgNet, where the angle of each class denotes its actual data proportion. Interior: Parent class. Exterior: Children class.

Dataset -- MVPNet

Derived from the dense reconstruction on MVImgNet (as mentioned in ), a new large-scale real-world 3D object point cloud dataset–MVPNet, is born, which contains 80,000 point clouds with 150 categories. Compared with existing 3D object datasets, our MVPNet contains a conspicuously richer amount of real-world object point clouds, with abundant categories covering many common objects in the real life. The category distribution of MVPNet dataset is shown in Fig. 4 .

Fig.3: Category distribution of MVPNet.

Fig.4: A variety of 3D object point clouds in MVPNet.

Video Youtube

Video Bilibili

BibTeX

                    @inproceedings{yu2023mvimgnet,
                        title     = {MVImgNet: A Large-scale Dataset of Multi-view Images},
                        author    = {Yu, Xianggang and Xu, Mutian and Zhang, Yidan and Liu, Haolin and Ye, Chongjie and Wu, Yushuang and Yan, Zizheng and Liang, Tianyou and Chen, Guanying and Cui, Shuguang, and Han, Xiaoguang},
                        booktitle = {CVPR},
                        year      = {2023}
                    }