Markerless tracking of complex human motions from multiple views

Published on Nov 1, 2006in Computer Vision and Image Understanding3.121
· DOI :10.1016/J.CVIU.2006.07.010
R. Kehl4
Estimated H-index: 4
(ETH Zurich),
Luc Van Gool127
Estimated H-index: 127
(Katholieke Universiteit Leuven)
Sources
Abstract
We present a method for markerless tracking of complex human motions from multiple camera views. In the absence of markers, the task of recovering the pose of a person during such motions is challenging and requires strong image features and robust tracking. We propose a solution which integrates multiple image cues such as edges, color information and volumetric reconstruction. We show that a combination of multiple image cues helps the tracker to overcome ambiguous situations such as limbs touching or strong occlusions of body parts. Following a model-based approach, we match an articulated body model built from superellipsoids against these image cues. Stochastic Meta Descent (SMD) optimization is used to find the pose which best matches the images. Stochastic sampling makes SMD robust against local minima and lowers the computational costs as a small set of predicted image features is sufficient for optimization. The power of SMD is demonstrated by comparing it to the commonly used Levenberg-Marquardt method. Results are shown for several challenging sequences showing complex motions and full articulation, with tracking of 24 degrees of freedom in ≈ 1 frame per second.
📖 Papers frequently viewed together
2,189 Citations
297 Citations
717 Citations
References37
Newest
#1Tomas Svoboda (CTU: Czech Technical University in Prague)H-Index: 21
#2Daniel Martinec (CTU: Czech Technical University in Prague)H-Index: 8
Last. Tomas Pajdla (CTU: Czech Technical University in Prague)H-Index: 52
view all 3 authors...
Virtual immersive environments or telepresence setups often consist of multiple cameras that have to be calibrated, We present a convenient method for doing this. The minimum is three cameras, but there is no upper limit. The method is fully automatic and a freely moving bpdght spot is the only calibration object, A set of virtual 3D points is made by waving the bright spot through the working volume. Its projections are found with subpixel precision and verified by a robust RANSAC analysis. The...
429 CitationsSource
Jun 20, 2005 in CVPR (Computer Vision and Pattern Recognition)
#1R. KehlH-Index: 4
#2M. BrayH-Index: 6
Last. L. Van GoolH-Index: 73
view all 3 authors...
We present a novel approach for full body pose tracking using stochastic sampling. A volumetric reconstruction of a person is extracted from silhouettes in multiple video images. Then, an articulated body model is fitted to the data with stochastic meta descent (SMD) optimization. By comparing even a simplified version of SMD to the commonly used Levenberg-Marquardt method, we demonstrate the power of stochastic compared to deterministic sampling, especially in cases of noisy and incomplete data...
113 CitationsSource
#1A. GriesserH-Index: 5
#2Stefaan De RoeckH-Index: 2
Last. Luc Van GoolH-Index: 127
view all 4 authors...
We present a GPU-based foreground-background segmentation that processes image sequences in less than 4ms per frame. Change detection wrt. the background is based on a color similarity test in a small pixel neighbourhood, and is integrated into a Bayesian estimation framework. An iterative MRFbased model is applied, exploiting parallelism on modern graphics hardware. Resulting segmentation exhibits compactness and smoothness in foreground areas as well as for inter-frame temporal contiguity. Fur...
50 Citations
#1Christian Theobalt (MPG: Max Planck Society)H-Index: 79
#2Joel Carranza (MPG: Max Planck Society)H-Index: 7
Last. Marcus Magnor (MPG: Max Planck Society)H-Index: 6
view all 3 authors...
High-quality nonintrusive human motion capture is necessary for acquisition of model-based free-viewpoint video of human actors. Silhouette-based approaches have demonstrated that they are able to accurately recover a large range of human motion from multiview video. However, they fail to make use of all available information, specifically that of texture information. This paper presents an algorithm that uses motion fields constructed from optical flow in multiview video sequences. The use of m...
43 CitationsSource
#1I. MikicH-Index: 18
#2Mohan M. Trivedi (UCSD: University of California, San Diego)H-Index: 86
Last. Pamela C. Cosman (UCSD: University of California, San Diego)H-Index: 44
view all 4 authors...
We present an integrated system for automatic acquisition of the human body model and motion tracking using input from multiple synchronized video streams. The video frames are segmented and the 3D voxel reconstructions of the human body shape in each frame are computed from the foreground silhouettes. These reconstructions are then used as input to the model acquisition and tracking algorithms. The human body model consists of ellipsoids and cylinders and is described using the twists framework...
297 CitationsSource
#1SandPeterH-Index: 1
#2McMillanLeonardH-Index: 1
Last. PopovićJovanH-Index: 1
view all 3 authors...
We describe a method for the acquisition of deformable human geometry from silhouettes. Our technique uses a commercial tracking system to determine the motion of the skeleton, then estimates geome...
53 CitationsSource
Jul 1, 2003 in SIGGRAPH (International Conference on Computer Graphics and Interactive Techniques)
#1Joel CarranzaH-Index: 7
#2Christian TheobaltH-Index: 79
Last. Hans-Peter SeidelH-Index: 118
view all 4 authors...
In free-viewpoint video, the viewer can interactively choose his viewpoint in 3-D space to observe the action of a dynamic real-world scene from arbitrary perspectives. The human body and its motion plays a central role in most visual media and its structure can be exploited for robust motion estimation and efficient visualization. This paper describes a system that uses multi-view synchronized video footage of an actor's performance to estimate motion parameters and to interactively re-render t...
510 CitationsSource
Jul 1, 2003 in SIGGRAPH (International Conference on Computer Graphics and Interactive Techniques)
#1Peter Sand (MIT: Massachusetts Institute of Technology)H-Index: 5
#2Leonard McMillan (UNC: University of North Carolina at Chapel Hill)H-Index: 54
Last. Jovan Popović (MIT: Massachusetts Institute of Technology)H-Index: 40
view all 3 authors...
We describe a method for the acquisition of deformable human geometry from silhouettes. Our technique uses a commercial tracking system to determine the motion of the skeleton, then estimates geometry for each bone using constraints provided by the silhouettes from one or more cameras. These silhouettes do not give a complete characterization of the geometry for a particular point in time, but when the subject moves, many observations of the same local geometries allow the construction of a comp...
127 CitationsSource
Jun 18, 2003 in CVPR (Computer Vision and Pattern Recognition)
#1Kong-Man (German) Cheung (CMU: Carnegie Mellon University)H-Index: 5
#2Simon Baker (CMU: Carnegie Mellon University)H-Index: 70
Last. Takeo Kanade (CMU: Carnegie Mellon University)H-Index: 160
view all 3 authors...
Shape-from-silhouette (SFS), also known as visual hull (VH) construction, is a popular 3D reconstruction method, which estimates the shape of an object from multiple silhouette images. The original SFS formulation assumes that the entire silhouette images are captured either at the same time or while the object is static. This assumption is violated when the object moves or changes shape. Hence the use of SFS with moving objects has been restricted to treating each time instant sequentially and ...
356 CitationsSource
#2Marc LapierreH-Index: 6
Last. Edmond BoyerH-Index: 5
view all 4 authors...
In this paper, we show how to capture an actor with no intrusive trackers and without any special environment like blue set, how to estimate its 3D-geometry and how to insert this geometry into a virtual world in real-time. We use several cameras in conjunction with background subtraction to produce silhouettes of the actor as observed from the different camera viewpoints. These silhouettes allow the 3D-geometry of the actor to be estimated by a voxel based method. This geometry is rendered with...
45 CitationsSource
Cited By94
Newest
We present a new method to capture detailed human motion, sampling more than 1000 unique points on the body. Our method outputs highly accurate 4D (spatio-temporal) point coordinates and, crucially, automatically assigns a unique label to each of the points. The locations and unique labels of the points are inferred from individual 2D input images only, without relying on temporal tracking or any human body shape or skeletal kinematics models. Therefore, our captured point trajectories contain a...
#1Nor Azrini Jaafar (UTM: Universiti Teknologi Malaysia)H-Index: 1
#2Nor Azman Ismail (UTM: Universiti Teknologi Malaysia)H-Index: 6
Last. Yusman Azimi Yusoff (UTM: Universiti Teknologi Malaysia)H-Index: 4
view all 3 authors...
In Muslim life, there is an important ritual that they need to do in their daily lives, a prayer known as salat. There was evidence that showed performing salat correctly is good for better health. This paper developed a motion recognition system for salat movement using a cooperative multisensor approach based on salat law. Existing work in this related field could recognize a few salat movements; however, they could not cover salat movements based on salat law by using a single camera. This pa...
1 CitationsSource
#1Yichao Yan (SJTU: Shanghai Jiao Tong University)H-Index: 10
#2Bingbing Ni (SJTU: Shanghai Jiao Tong University)H-Index: 41
Last. Xiaokang Yang (SJTU: Shanghai Jiao Tong University)H-Index: 70
view all 5 authors...
Video generation is a challenging task due to the extremely high-dimensional distribution of the solution space. Good constraints in the solution domain would thus reduce the difficulty of approximating optimal solutions. In this paper, instead of directly generating high-dimensional video data, we propose using object landmarks as explicit structure constraints to address this issue. Specifically, we propose a two-stage framework for an action-conditioned video generation task. In our framework...
3 CitationsSource
Jun 15, 2019 in CVPR (Computer Vision and Pattern Recognition)
#1Aliaksandra Shysheya (Samsung)H-Index: 3
#2Dmitry Ulyanov (Samsung)H-Index: 17
Last. Igor Pasechnik (Samsung)H-Index: 3
view all 12 authors...
We present a system for learning full body neural avatars, i.e. deep networks that produce full body renderings of a person for varying body pose and varying camera pose. Our system takes the middle path between the classical graphics pipeline and the recent deep learning approaches that generate images of humans using image-to-image translation. In particular, our system estimates an explicit two-dimensional texture map of the model surface. At the same time, it abstains from explicit shape mod...
51 CitationsSource
#1Hanbyul Joo (CMU: Carnegie Mellon University)H-Index: 12
#2Tomas Simon (CMU: Carnegie Mellon University)H-Index: 22
Last. Yaser Sheikh (CMU: Carnegie Mellon University)H-Index: 46
view all 13 authors...
We present an approach to capture the 3D motion of a group of people engaged in a social interaction. The core challenges in capturing social interactions are: (1) occlusion is functional and frequent; (2) subtle motion needs to be measured over a space large enough to host a social group; (3) human appearance and configuration variation is immense; and (4) attaching markers to the body may prime the nature of interactions. The Panoptic Studio is a system organized around the thesis that social ...
132 CitationsSource
Jan 5, 2018 in CVPR (Computer Vision and Pattern Recognition)
#1Hanbyul Joo (CMU: Carnegie Mellon University)H-Index: 12
#2Tomas SimonH-Index: 22
Last. Yaser Sheikh (Facebook)H-Index: 46
view all 3 authors...
We present a unified deformation model for the markerless capture of human movement at multiple scales, including facial expressions, body motion, and hand gestures. An initial model is generated by locally stitching together models of the individual parts of the human body, which we refer to as "Frank". This model enables the full expression of part movements, including face and hands, by a single seamless model. We capture a dataset of people wearing everyday clothes and optimize the Frank mod...
225 CitationsSource
Oct 19, 2017 in MM (ACM Multimedia)
#1Yichao Yan (SJTU: Shanghai Jiao Tong University)H-Index: 10
#2Jingwei Xu (SJTU: Shanghai Jiao Tong University)H-Index: 22
Last. Xiaokang Yang (SJTU: Shanghai Jiao Tong University)H-Index: 70
view all 5 authors...
This work makes the first attempt to generate articulated human motion sequence from a single image. On one hand, we utilize paired inputs including human skeleton information as motion embedding and a single human image as appearance reference, to generate novel motion frames based on the conditional GAN infrastructure. On the other hand, a triplet loss is employed to pursue appearance smoothness between consecutive frames. As the proposed framework is capable of jointly exploiting the image ap...
69 CitationsSource
#1Sung Soo Hwang (KAIST)H-Index: 3
#2Hee-Dong Kim (KAIST)H-Index: 3
Last. Seong-Dae Kim (KAIST)H-Index: 13
view all 7 authors...
This paper presents an image-based object reconstruction with a low memory footprint using run-length representation. While conventional volume-based approaches, which utilize voxels as primitives, are intuitive and easy to manipulate 3D data, they require a large amount of memory and computation during the reconstruction process. To overcome these burdens, this paper uses 3D runs to represent a 3D object and reconstructs each 3D run from multi-view silhouettes with a small amount of memory. The...
3 CitationsSource
#1Piotr Szczuko (GUT: Gdańsk University of Technology)H-Index: 10
The article presents a method for video anonymization and replacing real human silhouettes with virtual 3D figures rendered on a screen. Video stream is processed to detect and to track objects, whereas anonymization stage employs animating avatars accordingly to behavior of detected persons. Location, movement speed, direction, and person height are taken into account during animation and rendering phases. This approach requires a calibrated camera, and utilizes results of visual object trackin...
2 CitationsSource
#1Glyn Lawson (University of Nottingham)H-Index: 16
#2Davide Salanitri (University of Nottingham)H-Index: 6
Last. Brian Waterfield (Jaguar Land Rover)H-Index: 5
view all 3 authors...
Abstract Virtual Reality (VR) can reduce time and costs, and lead to increases in quality, in the development of a product. Given the pressure on car companies to reduce time-to-market and to continually improve quality, the automotive industry has championed the use of VR across a number of applications, including design, manufacturing, and training. This paper describes interviews with 11 engineers and employees of allied disciplines from an automotive manufacturer about their current physical...
107 CitationsSource