•June, 2022: CVPR 2022 oral video is available below. You can reach the slides in Keynote and PDF formats here.
•May, 2022: We applied 3D Common Corruptions to ImageNet and created the ImageNet-3DCC benchmark. It is also a part of the RobustBench now! Click here for a quickstart. The dataset can be accessed from here.
•March, 2022: A live demo is available. Try your images on our depth and surface normal models!
CVPR 2022 oral video (5 minutes) is available below.
You can also reach an extended overview video (12 minutes) from here.
Using 3D information to generate real-world corruptions.
The top row shows 2D corruptions applied uniformly over the image, e.g. as in Common Corruptions, disregarding 3D information.
This leads to corruptions that are unlikely to happen in the real world, e.g. having the same motion blur over the entire image irrespective of distance from camera (top left).
Middle row shows their 3D counterparts from 3D Common Corruptions (3DCC). The circled regions highlight the effect of incorporating 3D information.
More specifically, in 3DCC,
1. motion blur has a motion parallax effect where objects further away from the camera seem to move less,
2. defocus blur has a depth of field effect, akin to a large aperture effect in real cameras, where certain regions of the image can be selected to be in focus,
3. lighting takes the scene geometry into account when illuminating the scene and casts shadows on objects,
4. fog gets denser further away from the camera,
5. occlusions of a target object, e.g. fridge (blue mask), are created by changing the camera’s viewpoint and having its view naturally obscured by another object, e.g. the plant (red mask).
This is in contrast to its 2D counterpart that randomly discards patches.
The video below compares 2D and 3D corruptions for different queries and shift intensities.
Robustness in Real World
Computer vision models deployed in the real world will encounter naturally occurring distribution shifts from their training data.
These shifts range from lower-level distortions, such as motion blur and illumination changes, to semantic ones, like object occlusion.
Each of them represents a possible failure mode of a model and has been frequently shown to result in profoundly unreliable predictions.
Thus, a systematic testing of vulnerabilities to these shifts is critical before deploying these models in the real world.
A New Set of Realistic Corruptions
This work presents a set of distribution shifts in order to test models’ robustness.
In contrast to previously proposed shifts which perform uniform 2D modifications over the image, such as Common Corruptions (2DCC),
our shifts incorporate 3D information to generate corruptions that are consistent with the scene geometry.
This leads to shifts that are more likely to occur in the real world (See above figure).
The resulting set includes 20 corruptions, each representing a distribution shift from training data, which we denote as 3D Common Corruptions (3DCC).
3DCC addresses several aspects of the real world, such as camera motion, weather, occlusions, depth of field, and lighting.
Figure below provides an overview of all corruptions. As shown in the above figure, the corruptions in 3DCC are more diverse and realistic compared to 2D-only approaches.
The new corruptions.
We propose a diverse set of new corruption operations ranging from defocusing (near/far focus) to lighting changes and 3D-semantic ones, e.g. object occlusion.
These corruptions are all automatically generated, efficient to compute, and can be applied to most datasets.
We show that they expose vulnerabilities in models and are a good approximation of realistic corruptions.
The video below shows some corruptions from 3DCC with increasing shift intensities (severities).
Key Aspects of Our Work
Our paper incorporates 3D information into robustness benchmarking and training which opens up a promising direction for robustness research. Specifically:
•A challenging benchmark: We show that the performance of the methods aiming to improve robustness, including those with diverse data augmentation, reduce drastically under 3D Common Corruptions (3DCC). Furthermore, we observe that the robustness issues exposed by 3DCC well correlate with corruptions generated via photorealistic synthesis. Thus, 3DCC can serve as a challenging test bed for real-world corruptions, especially those that depend on scene geometry.
•New 3D data augmentations: Motivated by this, our framework also introduces new 3D data augmentations. They take the scene geometry into account, as opposed to 2D augmentations, thus enabling models to build invariances against more realistic corruptions. We show that they significantly boost model robustness against such corruptions, including the ones that cannot be addressed by the 2D augmentations.
•Easy to generate and extend: The proposed corruptions are generated programmatically with exposed parameters, enabling fine-grained analysis of robustness, e.g. by continuously increasing the 3D motion blur. The corruptions are also efficient to compute and can be computed on-the-fly during training as data augmentation with a small increase in computational cost. They are also extendable, i.e. they can be applied to standard vision datasets, such as ImageNet, that do not come with 3D labels.
Benchmarking with 3DCC
3DCC provides a challenging and realistic test bed to identify model failures. Let's look at some results.
• The video below shows predictions of a baseline surface normals model degrading with corrupted inputs.
• See also the predictions of a baseline segmentation model degrading with occlusions.
• The figure below quantitatively shows the degraded performance for normals and depth estimation tasks under real-world corruptions approximated by 3DCC.
Existing robustness mechanisms are found to be insufficient for addressing real-world corruptions approximated by 3DCC.
Performance of models with different robustness mechanisms under 3DCC for surface normals (left) and depth (right) estimation tasks are shown.
Each bar shows the l1 error averaged over all 3DCC corruptions (lower is better).
The red line denotes the performance of the baseline model on clean
The results denote that existing robustness mechanisms, including those with diverse augmentations, perform poorly under 3DCC.
Please see the paper for details.
• Please see the paper for more results, e.g. evaluations performed to demonstrate that 3DCC can expose vulnerabilities in models that are not captured by 2DCC and the generated corruptions are similar to expensive realistic synthetic ones.
Applying 3DCC to Standard Vision Datasets
While we employed datasets with full scene geometry information such as Taskonomy, 3DCC can also be applied to standard datasets without 3D information.
We exemplify this on ImageNet and COCO validation sets by leveraging depth predictions from a state-of-the-art depth estimator. The figure below shows
example corrupted images from 3DCC. Generated images are physically plausible, demonstrating that 3DCC can be used for other datasets by the community to generate a diverse set of image corruptions.
3DCC can be applied to most datasets,
even those that do not come with 3D information.
Several query images from the ImageNet and COCO dataset are shown above with near focus, far focus and fog 3D corruptions applied.
Notice how the objects in the circled regions go from sharp to blurry depending on the focus region and scene geometry.
To get the depth information needed to create these corruptions, predictions from MiDaS model is used.
This gives a good enough approximation to generate realistic corruptions (as also analyzed in the paper).
The video below shows sample corrupted ImageNet and COCO images with increasing shift intensities.
•ImageNet-3DCC & COCO-3DCC: See our repository which contains the scripts and depth data to apply 3DCC on ImageNet and COCO datasets.
We demonstrate in the paper, both qualitatively and quantitatively, that the proposed augmentations significantly improve model robustness compared to baselines. The figure below shows qualitative results on a diverse set of query images.
Qualitative results of learning with 3D data augmentation
on random queries from OASIS, Adobe After Effects generated data, manually collected DSLR data, and in-the-wild YouTube videos for surface normals.
The ground truth is gray when it is not available, e.g. for YouTube. Our predictions in the last row are noticeably sharper and more accurate compared to baselines. Please see the paper for more details.
We also evaluate the performance on several query videos. The predictions are made frame-by-frame with no temporal smoothing. Our model using the proposed augmentations is significantly more robust compared to baselines.
Live Demo & Pretrained Models (v2)
For a live demo on user uploaded images, please visit here. The models are trained using cross-task consistency and 3D data augmentations on Omnidata.
You can download the pretrained depth and surface normal models used in the live demo from here.
In case you use these pretrained models please cite the following papers: