HTCV
Updated 145 days ago
Abstract: Reconstructing humans in 3D from a single image is an inherently ambiguous problem since multiple 3D poses can lead to the same reprojection. However, most related works only return one pose estimate for each input image. This fails to capture the multimodal aspect of the problem and results in systems with potentially non-interpretable or non-trustworthy behavior. In this talk, I will discuss our recent work that tries to embrace this multimodality and recasts the problem as learning a mapping from the input to a distribution of plausible 3D poses. Our approach is based on the normalizing flows model and offers a series of advantages. For conventional applications, where a single 3D estimate is required, our formulation allows for efficient mode computation. Using the mode leads to performance that is comparable with the state of the art among deterministic unimodal regression models. Simultaneously, we demonstrate that our model is useful in a series of downstream tasks,..