Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet

Zou, Shihao; Xu, Yuanlu; Li, Chao; Ma, Lingni; Cheng, Li; Vo, Minh

Computer Science > Computer Vision and Pattern Recognition

arXiv:2207.04320 (cs)

[Submitted on 9 Jul 2022 (v1), last revised 12 Sep 2023 (this version, v3)]

Title:Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet

Authors:Shihao Zou, Yuanlu Xu, Chao Li, Lingni Ma, Li Cheng, Minh Vo

View PDF

Abstract:Multi-person pose understanding from RGB videos involves three complex tasks: pose estimation, tracking and motion forecasting. Intuitively, accurate multi-person pose estimation facilitates robust tracking, and robust tracking builds crucial history for correct motion forecasting. Most existing works either focus on a single task or employ multi-stage approaches to solving multiple tasks separately, which tends to make sub-optimal decision at each stage and also fail to exploit correlations among the three tasks. In this paper, we propose Snipper, a unified framework to perform multi-person 3D pose estimation, tracking, and motion forecasting simultaneously in a single stage. We propose an efficient yet powerful deformable attention mechanism to aggregate spatiotemporal information from the video snippet. Building upon this deformable attention, a video transformer is learned to encode the spatiotemporal features from the multi-frame snippet and to decode informative pose features for multi-person pose queries. Finally, these pose queries are regressed to predict multi-person pose trajectories and future motions in a single shot. In the experiments, we show the effectiveness of Snipper on three challenging public datasets where our generic model rivals specialized state-of-art baselines for pose estimation, tracking, and forecasting.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2207.04320 [cs.CV]
	(or arXiv:2207.04320v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2207.04320

Submission history

From: Shihao Zou [view email]
[v1] Sat, 9 Jul 2022 18:42:14 UTC (5,395 KB)
[v2] Wed, 13 Jul 2022 07:55:51 UTC (5,398 KB)
[v3] Tue, 12 Sep 2023 21:21:35 UTC (4,184 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators