Equivariant and Geometry-Aware 3D Perception for Robots and Automated Vehicles

Zhu, Minghan

Equivariant and Geometry-Aware 3D Perception for Robots and Automated Vehicles

Zhu, Minghan

2023

View/Open

minghanz_1.pdf

(22MB

PDF)

Abstract

This dissertation presents novel equivariant and geometric-aware learning methods to address 3D perception challenges in robotic and automated-driving applications. 3D perception allows computers to comprehend the real-world environment from sensor data like cameras and Lidars. Although deep neural networks are powerful, they cannot perfectly fit all input-output relations due to limited model capacity and data completeness considering the enormous variations in the real world. This work applies known geometric properties to improve the performance, efficiency, and reliability of deep models for two types of 3D perception problems: point-cloud-based and monocular-image-based. In point-cloud-based perception, the input data and the output targets are usually in the same space, which is the Euclidean 3D space where the physical world lives. Therefore, the input and output spaces carry the same transformations. We embed the equivariance property into deep models. This guarantees that transformations in the input space are preserved in the output space, enabling generalization. However, existing equivariant models present complexity and high computational costs. We design models equivariant to 3D rotations and rigid body transformations with a simpler network structure and significantly reduced computational cost. These are applied to various robotic perception tasks, including object classification, object pose estimation, keypoint matching, and point cloud registration, showing superior performance and robustness. We also apply the equivariant models in the larger-scale outdoor scenario and a more complicated perception task, 4D panoptic segmentation, for the first time achieving higher performance and lower computational cost simultaneously from an equivariant model. For monocular 3D perception tasks, the input data and the output targets are typically not in the same space, as we need to recover 3D information from a 2D projected image. The relationship between the 2D image and the 3D underlying scene can be explained by the homography, i.e., the projective geometry. Therefore, we incorporate the homography structure into our monocular perception models. Leveraging the homography between the road and image planes, we build a monocular 3D object detection network for roadside traffic cameras without intrinsic and extrinsic calibrations. We then generalize the homography between fixed road planes and cameras to variable homography between moving cameras and moving objects, based on which we develop a monocular 3D object detection network for driver-view cameras. Our network achieves higher accuracy through local homography and is the first to estimate object depth without camera intrinsic parameters. While SE(3)-equivariance is infeasible for monocular 3D models due to the lost depth during the projection, we are inspired from experience in point cloud learning that non-equivariant models can perform decently when the learning target is transformation-invariant. Therefore, we propose learning viewpoint-invariant targets for monocular 3D object detection models, i.e., the relative pose between objects. Experiments show that the proposed inter-object estimation module improves the overall 3D object detection performance and, furthermore, the estimation of relative poses between objects and the motion states of objects. Overall, this dissertation explores the value of equivariance and geometric structures in 3D perception tasks for robotic and automated driving-related applications. On the one hand, we develop equivariant models that are efficient and easy to be incorporated with general deep-learning models, improving their practicality. On the other hand, our work validates that exploiting the inductive bias of symmetry helps reduce the computation and improve the performance in large-scale 3D perception problems for robots and automated driving.

Deep Blue DOI

https://dx.doi.org/10.7302/8251

Subjects

computer vision

equivariant learning

robotics

deep learning

automated driving

Types

Thesis

Handle

https://hdl.handle.net/2027.42/177794

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.