Improving Viewpoint-Invariance and Temporal Consistency for Action Detection

Abstract

Temporal Action Detection (TAD) necessitates the precise recognition and localization of actions within untrimmed videos. Current approaches predominantly focus on single-view systems, which are constrained by a single perspective during both training and inference. Furthermore, existing skeleton-based methods typically rely on local temporal windows, often failing to capture long-range dependencies between these windows. To address these limitations, this paper introduces a novel multi-view framework designed to leverage complementary perspectives for more accurate action boundary detection. Our method employs a specialized encoder to extract motion features from localized temporal windows. These features are then integrated by HydraView, a multi-view and multi-scale temporal encoder that aggregates information across different perspectives to perform frame-level action detection. To mitigate the high computational overhead associated with managing long sequences in multi-view systems, we build HydraView upon the recent Mamba architecture, ensuring linear scaling and reduced inference time. Experimental results demonstrate that our approach outperforms several state-of-the-art TAD models on the BABEL and PKU-MMD datasets. Our code and pre-trained models will be made publicly available.

Method

SWGCN. Motion features are learned by small time windows for each viewpoint. The encoder model (SWGCN) is then frozen to extract window features with improved temporal consistency.

Hydraview. Multi-view Multi-Scale Learning for Temporal Encoding with HydraView. HydraView is composed of multiple ViewMamba blocks each responsible of a specific scale in views and time. The ViewMamba block is a 2D convolution followed by a dilation module scaling the input for SS2D. The representation from each ViewMamba block is then processed by a Multi-Scale Fuser to provide the action predictions.

ViewScan. Designed bidirectional scanning strategies along both view and temporal dimensions. This scheme shows actions seen from two viewpoints, and final empty cells mean a missing skeleton detection.

Results

PKUMMD. Results on two sequences from PKUMMDv1 (cross-subject split). Our model effectively scales to long-form sequences (up to 4 minutes) involving a large action vocabulary (51 classes). For visual clarity, we aggregate ``others'' instances into a single category, though they retain their individual class labels for evaluation purposes. The results demonstrate the model's ability to maintain temporal consistency across extended durations.

BABEL. Results on two sequences from BABEL (split 1). The predictions indicate the classification for each temporal window. Each action instance is correctly labeled. While some Detections exhibit minor temporal shifts at the boundaries, they maintain an Intersection over Union (IoU) exceeding 50%.