Detecting and Segmenting Humans in Crowded Scenes
ACM Multimedia Conference (ACMMM) 2007, Augsburg Germany.
Introduction
The ability to accurately detect and segment humans in video sequences represents an essential component in a wide range of application domains such as dynamic scene analysis, human-computer interface design, driver assistance systems, and the development of intelligent environments. Nevertheless, the problem of human detection has numerous challenges associated with it. Effective solutions must be able to account not only for the nearly 250 degrees of freedom of which the human body is capable, but also the variability introduced by various factors such as different clothing styles, and the presence of occluding accessories such as backpacks and briefcases. Furthermore, a significant percentage of scenes, such as urban environments, contain substantial amounts of clutter and occlusion.
The shape of the human silhouette is often very different from the shape of other objects in a scene. Therefore, shape-based detection of humans represents a powerful cue which can be integrated into existing lines of research. As opposed to the appearance-based models, human shapes tend to be somewhat isotropic. Hence, shape-based methods coupled with other cues, such as motion, can provide a discriminating factor for recognition.
In this work we address the challenge of detecting and segmenting humans in video sequences containing crowded real-world scenes. Due to the difficulty of this problem, reliance on any single global model or feature alone would be ineffective. Therefore, successful approaches must be capable of integrating both global and local shape cues.
Detecting Humans in Cluttered Scenes
The method begins by learning a set of global posture clusters which are used to initialize segmentation. Additionally, we learn a codebook of local shape distributions based on humans in the training set. When the system is presented with a new frame from the testing video sequence, it extracts contours from the foreground blobs in each frame, samples them using shape context, finds instances of the learned local shape codebook, and casts votes for human locations and their respective postures in the frame. Subsequently, the system searches for consistent hypotheses by finding maxima within the voting space. Given the locations and postures of humans in the scene, the method proceeds to segment each subject. This is achieved by projecting the mean posture shape corresponding to the posture cluster of every consistent hypothesis around the centroid vote.
Given a testing video sequence, at each frame we extract contours from the foreground blobs produced by background subtraction. These contours are then sampled using shape context, producing a series of shape context descriptors for each contour. Each descriptor is then compared to the learned codebook of local shapes. If a match is found, the corresponding codebook entry will cast votes for the possible centroid of a human in the scene (Figure 4) and a posture cluster to which it belongs. Votes are aggregated in a voting space and Mean-Shift is used to find maximums.
Evaluation
We evaluated the performance our system for detecting and segmenting humans on a set of challenging video sequences containing significant amounts of partial occlusion. Furthermore, the videos included in the testing procedure featured humans performing a diverse set of activities within different contexts, such as walking on a busy city street, running a marathon, playing soccer, and participating in a crowded festival.
Our training set ranged from 700 to 1,100 frames from the various video sequences containing human samples. Metadata in the training set included the centroid of each silhouette, along with the posture cluster to which it belonged. Our testing database consisted of a wide range of scenes, totalling 34,100 frames in size and contained a total of 312 humans for which the torso is visible. The size of the humans across the video sequences averaged 22x52 pixels.
Conclusion
We have presented a framework for detecting and segmenting humans in real-world crowded scenes which integrates both local and global shape cues. Cluttered scenes containing many occlusions render the lone use of global shape representations as ineffective. Instead we aggregate local shape evidence via a codebook of local shape distributions for humans in various postures. Additionally, we found that a set of learned global posture clusters aids the segmentation process.Our experiments indicate that local shape distribution represents a powerful cue which can be integrated into existing lines of research.
Resources
Camera ready version of the paper