HUM4D provides synchronized multi-view RGB-D sequences aligned with professional Vicon motion capture ground truth, designed to benchmark markerless human motion capture under severe occlusion and multi-person interactions.
Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset's realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.
The dataset includes challenging scenarios such as
Jittering,
Identity Switching,
Occlusion, and
Near-Far Interaction.
Pipeline. Multi-view RGB-D capture is synchronized with Vicon motion capture. Marker trajectories are reconstructed and retargeted to SMPL to produce pose (θ), shape (β), and translation (t), along with evaluation-ready annotations.
Capture Environment. Professional motion capture studio with 44 synchronized infrared Vicon cameras and a multi-view RGB-D setup.
Hardware Setup. From left to right: RGB-D camera perspective layout (1.45 m height), top-view circular arrangement (3 m radius), Intel RealSense D455 sensor, and the Vicon motion capture system.
Dataset structure. HUM4D provides synchronized RGB-D sequences aligned with marker-based MoCap ground truth. We release evaluation-ready annotations and organized data in a hierarchical structure for easy navigation.
Behind the scenes. Footage from the HUM4D recording sessions, illustrating the multi-sensor setup and multi-person interactions.
You can download the dataset from the following links:
https://drive.google.com/drive/folders/1OnaU6yBmZEyM6ZM0C2IOoWplgkLKpo6P?usp=drive_link
For data that are not publicly available but are included in HUM4D, contact us at
cszghp [at] gmail.com.
For questions, please contact cszghp [at] gmail.com.
The authors would like to thank Michael Walsh for his assistance with the human motion capture acquisition at the RELLIS Starlab facility at Texas A&M University (TAMU). We also thank Morgan Jenks for managing and operating the Vicon motion capture system and for overseeing all aspects of data acquisition. We also acknowledge valuable discussions and feedback from John Keyser from the Department of CSCE at TAMU. Additionally, we thank Jyothi Naidu for support in facilitating the IRB approval process.
@inproceedings{park2026hum4d,
title={A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture},
author={Park, Yeeun and Naduthodi, Miqdad and Kumar, Suryansh},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)},
year={2026}
}