People Detection Ali Taalimi 11/19/2012.

1 people Detection Ali Taalimi 11/19/2012 ...
Author: Jade Lamb
0 downloads 3 Views

1 people Detection Ali Taalimi 11/19/2012

2 Object Detection Outline Macro Scheme Main Component People DetectionCrowd Suggested Approach 12/1/2017 Slide 2/90

3 Macro Scheme Input Frames Object Detection Object TrackingScene Understanding Scene Modeling Object Tracking Temporal Coherency 12/1/2017 Slide 3/90

4 Training class specific Patch modelMacro Scheme Bottom–Up Scheme Tracking Grouping Patch Classifier Patch Detector Training class specific Patch model Frames Patch List Identified Object Object List 2011_A Review of Computer Vision Techniques for the Analysis of Urban Traffic 12/1/2017 Slide 4/90

5 Top–Down Scheme Macro Scheme Mask Silhouette List Object ListTrajectories Tracking Classifier Grouping (Conn Comp) Foreground Estimation Background Model (history) Frames Fixed Rules Or Training 2011_A Review of Computer Vision Techniques for the Analysis of Urban Traffic 12/1/2017 Slide 5/90

6 Main Components method to extract relevant information from an image area occupied by a target. representation for appearance and shape of a target to be used by the tracker and descriptive enough to cope with clutter. links different instances of the same object over time which has to compensate for occlusion, clutter, illumination. method to extract relevant information from an image area occupied by a target. This method can be simply extracting low level features (color) or object classification, change detection, motion classification. representation for encoding the appearance and shape of a target . This representation defines characteristics of the target to be used by the tracker. Representation should be descriptive enough to cope with clutter and similar targets. The definition of a method to propagate the state of the target over time. This task links different instances of the same object over time and has to compensate for occlusion, clutter, illumination. 12/1/2017 Slide 6/90

7 Features Low Level Features: Mid-Level Features: High Level Features:Color: RGB, CIE, CIELUV, HIS Gradient: Local intensity changes within (different reflectance properties of object parts, skin, hair) and at the boundary (different reflectance properties of the object with background) . Laplacian of Gaussian (LoG)/Difference of Gaussains (DoG) Motion: to detect and localize objects over time (optical flow). Low level features cannot describe image contents completely. Mid-Level Features: Using subset of pixels that represent structures (edges, interest points/regions) Interest points detector: select highly distinctive features that can be localized across multiple frames when pose and illumination change (corners). High Level Features: Detect object as a whole based on its appearance Group mid level features background modeling Object Modeling (color based segmentation of the face detection) Object Modeling: model the appearance of a pre defined class of targets. The model is computed by learning representative features of the object class. Example: color based segmentation of the face detection 12/1/2017 Slide 7/90

8 Target RepresentationHow we can define a target in terms of its shape and appearance. A target representation is a model of the object that is used by a tracking algorithm using shape and appearance of the target. Shape/Appearance info can be encoded at different levels of resolution. Example: using bounding box or a deformable contour to approximate the shape of the target, using pdf of some appearance features computed within the target area for encoding appearance info. Uncertainty factors, like illumination changes, clutter, target interaction and occlusion should be account for defining how to represent target. 12/1/2017 Slide 8/90

9 Target RepresentationShape Representation: Basic, Articulated, Deformable Representations A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, pp. 1–45, 2006. 12/1/2017 Slide 9/90

10 Target RepresentationShape Representation: Basic Models: Point Approximation: limitation: target occlusion (as they don’t provide an estimate of the size of target ) Area Approximation: bound target with a rectangle or ellipse. Motion, color, gradient of the entire target area used for tracking. Tracking is performed by estimation parameters describing possible transformations of center, axes of these approximated shapes. Lack of depth Volume Approximation: multiple object occluding can be handled using spatial volume occupied by the target. Lack of generality: only for objects available in training dataset. Articulated models: approximate shape of the target by combining set of rigid models based on topological connections and motion constraints e.g. full body tracking of human using topology of human skeleton 12/1/2017 Slide 10/90

11 Target RepresentationShape Representation: Deformable Models: Shape rigidity or kinematic assumptions doesn’t work for all target classes. Where it happens? Prior information on the object shape may not be available. Object have deformations that are not well modeled by canonical joints. Fluid Models: Instead of tracking the entire target area, interest points can be identified on the object and then used to track its parts. No explicit motion constraints between the different parts Example: Tracking detectable corners of the object. Stable object tracking in occluded state. Problem: how to group the points (which of them belong to the same object?). Contours: more accurate description of the target. Contour based trackers use set of control points positioned along the contour. Concatenation of the coordinates of control points will be considered. Use prior info of the target shape 12/1/2017 Slide 11/90

12 Part Based Model two different scales:model consists of a global “root” filter and several part models. Both root and part filters are computed by dot product between a set of weights and HOG features. two different scales: Coarse features are captured by a rigid template of entire detection window. Finer scale features are captured by part templates with respect to the detection window. this models include both a coarse global template covering an entire object and higher resolution part templates. The templates represent histogram of gradient features. We believe that handling partially labeled data is a significant issue in machine learning for computer vision. We treat the position of each object part as a latent variable. Our system uses a scanning window approach. A model for an object consists of a global “root” filter and several part models. Each part model specifies a spatial model and a part filter. The spatial model defines a set of allowed placements for a part relative to a detection window, and a deformation cost for each placement. Both root and part filters are scored by computing the dot product between a set of weights and histogram of gradient (HOG) features within a window. P. Felzenszwalb. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008. 12/1/2017 Slide 12/90

13 Target RepresentationAppearance Representation The appearance representation is a model of the expected projection of the object appearance onto the image plane. Unlike before, the representation of the appearance of the target may be specific to a single object and does not generalize across objects of the same class. Usually paired with a function that given the image estimate the likelihood of an object being in a particular state. Target Position Input image Feature Extraction Color, Gradient Appearance Representation, Template, Histogram Learning Target Modeling Template: encode positional information of color values of all pixels within the target area. Histogram: of Color, of Gradient 12/1/2017 Slide 13/90

14 People Detection Outline ROI Selection Background ModelingObject Detection People Detection ROI Selection Background Modeling Classification Tracking Crowd Suggested Approach 12/1/2017 Slide 14/90

15 Challenges Jitendra Malik 12/1/2017 Slide 15

16 Nontarget Class SamplesPeople Detection main components of a people detection system: hypothesis generation (ROI selection) classification (model matching, verification ) Tracking (temporal integration ) Evaluation ROI selection: initialized by general low-level features or prior scene knowledge. Classification/Tracking require models of the people class, in terms of geometry, appearance or dynamics. Target Class Samples Nontarget Class Samples Learning Class Model New Images detected candidate targets Classification MONOCULAR people DETECTION: SURVEY AND EXPERIMENTS 12/1/2017 Slide 16/90

17 ROI Selection sliding window technique: detector windows at various scales and locations are shifted over the image. computational costs are often too high. speedups by either coupling with a classifier cascade, or by restricting the search space based on prior information about the target object class: such as geometry of peoples, e.g., object height or aspect ratio features derived from the image data object motion Background subtraction interest-point detectors (ISM) confidence density of the detector: sliding-window: density is implicitly sampled in a discrete 3D grid (location and scale) by evaluating the different detection windows with a classifier. Feature based: density is explicitly created in a bottom-up fashion through probabilistic votes cast by matching, local features. MONOCULAR people DETECTION: SURVEY AND EXPERIMENTS interest-point detectors to recover regions with high information content based on local discontinuities of the image brightness function that often occur at object boundaries 12/1/2017 Slide 17/90

18 how to estimate posterior probability: Challenge:Classification Receives a list of ROIs that are likely to contain a people. verification (classification) will work on people appearance models, using various spatial and temporal cues. a given image/subregion is assigned to either the people or non-people class, based on its class posterior probabilities. class posterior probabilities: probability of having people in that region given a model. how to estimate posterior probability: generative and discriminative models Challenge: missing detections: not all people are detected in each frame false positive detections: detect non people in case of no depth or scene information (e.g., ground plane), the detector does not know where to expect objects of which size in the image. MONOCULAR people DETECTION: SURVEY AND EXPERIMENTS 12/1/2017 Slide 18/90

19 Classification D. Geronimo, “Surveyon people detection for advanced driver assistance systems,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1239–1258, 2010. 12/1/2017 Slide 19/90

20 Classification Generative Models:modeling the appearance of the people class in terms of its class conditional density function. Shape models: reducing variations in people appearance due to lighting or clothing. Discrete approaches: represent the shape manifold by a set of large amount of exemplar shapes. Continuous shape models: parametric representation of the class-conditional density, learned from a set of training shapes. Combined shape and texture models: combine shape and texture information within a compound parametric appearance model. texture cue represents the variation of the intensity pattern across the image region of target objects. separate statistical models for shape and intensity variations. Our focus is on 2D people shape models that are commonly learned from shape contour examples. MONOCULAR people DETECTION: SURVEY AND EXPERIMENT Page 2181 Generative approaches to people classification model the appearance of the people class in terms of its class conditional density function. In combination with the class priors, the posterior probability for the people class can be inferred using a Bayesian approach. 12/1/2017 Slide 20/90

21 Discriminative Models:Classification Discriminative Models: approximate the Bayesian MAP decision by learning the parameters of a discriminant function (decision boundary) between the people and nonpeople classes from training examples. Features: Haar wavelet feature codebook feature patches gradient orientation histograms: dense and sparse (SIFT) spatial configuration of salient edge-like structures: shapelet, edgelet spatiotemporal features to capture human motion, especially gait Classifier Architectures: determine an optimal decision boundary between pattern classes in a feature space. Feed-forward multilayer neural networks Linear/Nonlinear Support Vector Machines (SVMs) AdaBoost MONOCULAR people DETECTION: SURVEY AND EXPERIMENTS AdaBoost: the cascade structure is tuned to detect almost all peoples while rejecting nonpeoples as early as possible. 12/1/2017 Slide 21/90

22 Multipart representations:Classification Multipart representations: break down the complex appearance of the people class into subparts. make local pose-specific people clusters, and training of a specialized expert classifier for each subspace. integrate the individual expert responses to a final decision (model for geometric relations between parts) MONOCULAR people DETECTION: SURVEY AND EXPERIMENTS Pose-specific: half face/full face/…. all experts are run in parallel, where the final decision is obtained as a combination of local expert responses using maximum selection, majority voting, AdaBoost, trajectory based data association, and probabilistic shape-based weighting. Framework overview: Multi-cue component-based expert classifiers are trained off-line on features derived from intensity, depth and motion. On-line, multi-cue segmentation is applied to determine occlusion-dependent component weights for expert fusion. Data samples are shown in terms of intensity images, dense depth maps and dense optical flow (left to right). M. Enzweiler, “Multi-cue ped. classification with partial occlusion handling,” in IEEE Conf. Computer Vision and Pattern Recognition, 2010. 12/1/2017 Slide 22/90

23 Multipart representations:Classification Multipart representations: codebook representations: represent peoples in a bottom-up fashion as assemblies of local codebook features, combined with top-down verification. Codebook Generation B. Leibe, E. Seemann, B. Schiele, people detection in crowded scenes, CVPR 2005. The codebook is generated by extracting local feature descriptors (e.g. SIFT, Shape Context) around interest points (e.g. Harris pints, DoG, Harris-Laplace). 1) Full body classification needs huge number of training examples to adequately cover the set of possible appearances. 2) the expectation of missing parts due to scene occlusions or interobject occlusions is easier addressed, particularly if explicit interobject occlusion reasoning is incorporated into the model. 3) higher complexity in both model generation (training) and application (testing). Codebook: The training procedure. Local features are extracted around interest points and clustered to form an appearance codebook. For each codebook entry, a spatial occurrence distribution is learned and stored in non-parametric form (as a list of occurrences) people Detection in Crowded Street Scenes, Seemann et al, CVPR2005. 12/1/2017 Slide 23/90

24 Different Learning Algorithms/ClassifiersDefinition Properties SVM Find a decision boundary by maximizing the margin between the different classes. Decision boundary can be linear. Data can be of any type, i.e. scalar, vector features, intensity AdaBoost Constructs a strong classifier by attaching weak classifiers in an iterative greedy manner. High speed because of cascades. Can be combined with any classifier to find weak rules Neural Networks Different layers of neurons provide a nonlinear decision. Many configurations and parameters to choose. Raw data (intensity, gradient mag) is often used. PAMI2010_survey people Detection for Advanced Driver Assistance Systems 12/1/2017 Slide 24/90

25 Result of ClassificationHolistic algorithms are unable to deal with high variability: nonstandard poses greatly affect their performance diversity of poses causes many peoples to be poorly represented during training (e.g., running people, children, etc.) Parts-based algorithms that rely on dynamic part detection, handle pose changes better than holistic approaches. Support vector machines (SVMs) and boosting are the most popular choices for LEARNING/Classifier. Nearly all modern detectors employ: gradient histograms grayscale (e.g. Haar wavelets ) color, texture self-similarity and motion features. PAMI2010_survey people Detection for Advanced Driver Assistance Systems 12/1/2017 Slide 25/90

26 Only Detection without TrackingOverall Performance Only Detection without Tracking (1) peoples at least 80 pixels tall, 20-30% of all peoples are missed (at 1 false alarm per 10 images). (2) Performance degrades catastrophically for smaller peoples. While peoples pixels tall, around 80% are missed by the best detectors (at 1 false alarm per 10 images). (3) Performance degrades similarly under partial occlusion (under 35% occluded) (4) Performance is very bad at far scales (under 30 pixels) and under heavy occlusion (over 35% occluded). Under these conditions nearly all peoples are missed even at high false positive rates. P. Dollar, C. Wojek, B. Schiele, and P. Perona. people detection: An evaluation of the state of the art. In PAMI, volume 99, 2012 12/1/2017 Slide 26/90

27 Background Modeling steps and issues:Statistical Background Modeling (GMM, KDE) Background Clustering Neural Network Background Modeling Background Estimation steps and issues: background modeling background initialization background maintenance foreground detection choice of the feature size (pixel, a block or a cluster) choice of the feature type (color/edge/motion/texture features) critical situations in video sequence: Bootstrapping, Camouflage, Moved background objects, Inserted background, Waking foreground object, Sleeping foreground object and Shadows, Dynamic backgrounds, Illumination changes two constraints: less computation time, less memory requirement 12/1/2017 Slide 27/90

28 Statistical Background ModelingCategory Methods First Category Mixture of Gaussians (MOG) Kernel Density Estimation (KDE) Principal Components Analysis (PCA) Second Category Support Vector Machine (SVM) Support Vector Regression (SVR) Support Vector Data Description (SVDD) Third Category Single General Gaussian (SGG) Mixture of General Gaussians (MOGG) Independent Component Analysis (ICA) Incremental Non Negative Matrix Factorization (INMF) Incremental Rank-(R1,R2,R3) Tensor (IRT) T. Bouwmans, F. El Baf, and V. B. Statistical Background Modeling for Foreground Detection: A Survey, volume 4 of Handbook of Pattern Recognition and Computer Vision, chapter 3. World Scientific Publishing, 2010. 12/1/2017 Slide 28/90

29 Disadvantages of MOG the number of Gaussians must be predeterminedthe need for good initializations the dependence of the results on the true distribution which can be non-Gaussian slow recovery from failures the need for a series of training frames absent of moving objects the amount of memory required in this step. Solution: Intrinsic, Extrinsic 12/1/2017 Slide 29/90

30 MOG improvements Intrinsic Extrinsic Background Step ParametersBackground Initialization Variable K Variables μ , σ , ω Background Maintenance Learning rates α , ρ Foreground Detection Different measure for the matching test Probabilities Foreground model Methods Markov Random Fields Hierarchical approaches Multi-level approaches Multiple backgrounds Graph cuts Multi-layer approaches Tracking feedback 12/1/2017 Slide 30/90

31 Features Improvements of the MOGOriginal method uses only RGB values of a pixel without assuming spatial knowledge. Feature Size Block Cluster Feature Type Color features Normalized RGB YUVHSV HSI Luv Improved HLS YCrCb Edge features Texture features Spatial features Motion features HOG features Video features 12/1/2017 Slide 31/90

32 Subspace Learning using PCASubspace learning offer a good framework to deal with illumination changes as it allows taking into account spatial information. By assuming that the large part of image is background, expect to have only background using M largest Eigen vectors. Limitation: size of the foreground object must be small Foreground should not appear in the same location during a long period in the training (stationary/slow motion foreground) time consuming specially for colorful image. 12/1/2017 Slide 32/90

33 Tracking provide correspondences between the regions of consecutive frames based on the features and a dynamic model. Challenges in video tracking: Robustness to clutter, Occlusion False positives/negatives Stability Target changes its pose so its appearance as seen by camera occlusion by another moving object Clutter in video tracking: Objects in the background share similar color/shape with target 12/1/2017 Slide 33/90

34 Similarity of appearance (clutter) Variations of appearanceTracking Challenges Similarity of appearance (clutter) Variations of appearance Tracking Challenges Illumination Scene Weather Partial Occlusions Rotation Total Sensor noise Object Pose Translation Deformation 12/1/2017 Slide 34/90

35 Video Tracking In Computer Vision:Prevent false detections using all of the data within a spatial/temporal sliding window representing the most recent part of video. Consider motion estimates to have correct data associations. Predicting future peoples positions, thus feeding the foreground segmentation algorithm with pre-candidates. Video Tracking In Computer Vision: Window Tracking Feature Tracking: detectable part of images Tracking Local Features (corners, edges, lines) Optic Flow Methods Tracking Extended Features (ellipses, rectangles, contours, regions) Tracking Deformable Contours (snake, parametric geometric model, deformable templates) Visual Learning: instead of capturing shape/appearance apriori, learn it by examples. Video Tracking: A Concise Survey Window Based: Windows can be tracked from frame to frame by correlation-like correspondence (matching) methods, The assumption is that the intensity pattern changes little from frame to frame Feature Based: We define features as detectable parts of an image which can be used to support a vision task,1 for instance corners, lines, contours, or specially defined regions. Feature tracking first locates features in two subsequent frames, and, then matches each feature in I(t) with one feature in I(t+1) if such a match exists). Local features offers some invariance to image changes caused by scene or illumination changes, improving detectability over time. Notice that we make a difference between windows and regions: “window” is taken to indicate any subimage (generally rectangular), “region” one with specific properties (generally free-form). Region- based tracking systems have been reported for surveillance (people detection and tracking) Foundational in Tracking Deformable Contours is work on 1)snakes i.e., image contours formed of discrete particles (pixels) bound together by internal elastic forces and sensitive to forces created by image gradients. Both types of forces create a potential energy that the snake seeks to minimize by changing its shape. At convergence, the snake has moulded itself along an image contour defining an area of interest; in our case, an object being tracked. 2) An alternative is to use a parametric geometric model, typically B-splines [77], to represent the contour being tracked. 3) deformable templates : The distinguishing feature of these methods is the use of a specific prototype shape (e.g., an eye, a fish), specified by a set of landmark points. The motion of the landmarks is restricted by a motion model. Tracking a complex shape is therefore reduced to tracking a discrete set of points forming a fixed, deformable template Visual Learning: The main idea is to get a system to learn the shape (appearance) and dynamics of complex, deformable objects from example videos or images, as opposed to capturing shape and dynamics in a priori models. The collection of example images (usually hundreds or thousands of images) is processed to extract a common description, typically via PCA, statistical learning (e.g., support vector machines [87], [88], or estimation theory. The algorithm first learns a shape model, capturing the space of all possible shape, then uses the model to track targets. 12/1/2017 Slide 35/90

36 Tracking Tracking systems address: motion and matchingMotion problem: identify a limited search region in which the element is expected to be found with high probability simplest way: define the search area in the next frame as a fixed-size region surrounding the target position in the previous frame. Kalman filter (KF), Particle Filtering, Spatio-Temporal MRF, Graph correspondence, Event cones Matching problem: (also known as detection or location) identify the image element in the next frame within the designated search region similarity metric to compare candidates in the previous and current frame. data association: occurs in the presence of interfering target/trajectories Stable Multi-Target Tracking in Real-Time Surveillance Video A Review of Computer Vision Techniques for the Analysis of Urban Traffic Video Tracking: A Concise Survey A tracking-specific problem is data association(temporal coherency), that is, finding the true position of the moving target in the presence of equally valid candidates for the similarity metric. Robustness to clutter: the tracker should not be distracted by image elements resembling the target being tracked. Robustness to occlusion: tracking should not be definitely lost because of temporary target occlusion (drop-out), but resumed correctly when the target reappears (drop-in). False positives/negatives: only valid targets should be classified as such, and any other image element ignored (in practice, the number of false alarms should be as small as possible). 12/1/2017 Slide 36/90

37 Tracking Tracking by Detection Approachinvolve the continuous application of a detection algorithm in individual frames and the association of detections across frames. generally robust to changing background and moving cameras. Why is association between detections and targets difficult? Detection result degrades in occluded scene. Detector output is unreliable and sparse, i.e., detectors only deliver a discrete set of responses and usually yield false positives and missing detections. The output of a person detector (right: ISM , left: HOG) with false positives and missing detections. PAMI2011_Online Multiperson Tracking-by-Detection from a Single, Uncalibrated Camera M. D. Breitenstein, et al, Online multi-person tracking-by-detection from a single, uncalibrated camera," IEEE Trans. on Pattern Analysis and Machine Intell. (PAMI), vol. 33, no. 9, 2011. 12/1/2017 Slide 37/90

38 Crowd Outline Crowd Challenges Crowd Information ExtractionObject Detection People Detection Crowd Crowd Challenges Crowd Information Extraction Crowd Dynamic/Analysis Tracking in Crowd Scene Suggested Approach 12/1/2017 Slide 38/90

39 Different Levels of Crowd Densities(a) Very low density; (b) Low density; (c)Moderate density; (d) High density; (e) Very high density A. Marana, L. da Costa, R. Lotufo, and S. Velastin, “On the efficacy of texture analysis for crowd monitoring,” in Proc. Int. Symp. Computer Graphics, Image Processing, and Vision (SIBGRAPI’98), Washington, DC, 1998,p. 354. 12/1/2017 Slide 39/90

40 Crowd Challenges straightforward extension of techniques designed for noncrowded scenes don’t work for crowded situations. Because of severe occlusion, it is difficult to segment and track each individual in crowd. In high-density video sequences the accuracy of traditional methods for object tracking decreases as the density of people increases. Dynamic of a crowd itself is complex. Goal Directed Psychological Characteristic Occlusion Reasoning: If a partially occluded person is detected and associated to a trajectory, the classifier will be updated with noise and performance will degrade. 12/1/2017 Slide 40/90

41 Crowd information extraction1. Crowd density measurement: crowd density or counting number of peoples 2. Recognition: In extremely cluttered scenes, individual people segmentation is exceedingly difficult. Face and Head Recognition people and Crowd recognition: Occlusion handling Moving Camera: on-board vision system to assist a driver Spatial-temporal methods Tracking Solve occlusion during and after the occurrence of occlusions. using traceable image features Human body model: models human body parts. Tracking is implemented by probabilistic data association, i.e. matching the object hypotheses with the detected response Tracking inference strategies: particle filtering, MHT, JPDAF, Hungarian algorithm, greedy search, … 2008_Crowd Survey Tracking inference strategies: Using independent trackers requires solving a data association problem to assign detections to targets. 12/1/2017 Slide 41/90

42 Crowd Analysis Crowd analysis using computer vision techniques, Jacquez et al, 2010, IEEE signal processing magazine 12/1/2017 Slide 42/90

43 Crowd Dynamics/AnalysisCrowds can be characterized considering three different aspects: image space domain sociological domain: study the behavior of people in several years. computer graphics domain. CROWD ANALYSIS Three issues of analysis of crowded scenes: people counting/density estimation models tracking in crowded scenes crowd behavior understanding models counting vs. tracking: Similarity: the goal of both is to identify the participants of a crowd Difference: counting is about to estimate of the number of people, position and temporal evolution isn't considered. Tracking determines the position of each person in the scene as a function of time. 12/1/2017 Slide 43/90

44 Crowd Analysis pixel-based analysis: texture-based analysis:PEOPLE COUNTING/DENSITY ESTIMATION MODELS pixel-based analysis: based on very local features (individual pixel analysis via background subtraction or edge detection). mostly focused on density estimation rather than precise people counting. texture-based analysis: requires the analysis of image patches. explores higher-level features comparing to pixel-based approaches object-level analysis: try to identify individual objects in a scene. produce more accurate results comparing to the other two identifying individuals is feasible in low density crowds Very hard to solve for denser crowds Highly dependent on the extraction of foreground blobs that generate the image features crowd density analysis could be used to measure the comfort level in public spaces, or to detect potentially dangerous situations in still images or video sequences. pixel-based approaches and those that rely on texture analysis explore lower-level features in the image, not trying to identify individuals in a scene. These classes are usually less accurate for people counting, but they tend to work better in very high-density crowds. object-level analysis are adequate for more accurate counting and localization of people in a scene, since they are based on individual identification. Usually, such class is adequate in low or moderately denser crowds, since occlusions become significant in packed crowds. 12/1/2017 Slide 44/90

45 Tracking In Crowded Scenesunstructured environments: motion of a crowd appears to be random with different participants moving in different directions over time (e.g., a crossway). Approach should allow each location of the scene to have various crowd behaviors. using the head as the point of reference rather than the entire body. because heads are rarely obscured from overhead surveillance cameras and are generally not obscured by clothing. Multiple Target Tracking: appearance based methods: feed-forward systems which use only current and past observations to estimate the current state. data association based methods which use future information also, to estimate the current state, allowing ambiguities to be more easily resolved at the cost of increased latency Stable Multi-Target Tracking in Real-Time Surveillance Video 12/1/2017 Slide 45/90

46 Suggested Approach Outline Object Detection and TrackingPeople Detection Crowd Suggested Approach 12/1/2017 Slide 46/90

47 Goal Make Reliable Detection: low FP, FN (miss detection)Segment multiple possibly occluded humans in the image. Detect head/face of each person in video sequences. To obtain consistent trajectories of multiple possibly occluding humans in the video sequence. To track human robustly. Occlusion Reasoning Detect Faces: We have found that 100 pixels between outer eye corners of a face are necessary for 3-D modeling and recognition. From: Identifying Non-cooperative Subjects at a Distance Using Face Images and Inferred 3D Face Models 12/1/2017 Slide 47/90

48 Foreground Segmentation Object Classification Verification/ RefinementMy Main Scheme Regions of interest Labeled ROIs Verified and refined ROIs Preprocessing Foreground Segmentation Object Classification Verification/ Refinement Tracking 12/1/2017 Slide 48/90

49 Detection Using multiple detection modules:Object detection can be performed by modeling and then classifying background and foreground. Training classifier on the appearance of the background pixels and then a detection is associated with each connected region/blob of the foreground pixels. Train a set of classifiers to encode the people (foreground). Statistical shape-and-texture appearance models to define a representation of people appearance. Distinguishes people from other objects (e.g., cars) using shape and periodic motion cues. 4) Constructs a model for each person during tracking that can be used to identify people after occlusion Using multiple detection modules: Shape based detection + texture based classification. We should explain why certain modules were selected and how they were integrated in the overall system. 12/1/2017 Slide 49/90

50 Detection/Tracking Head Detection for Tracking:Detecting head in specific and part based model in general can be more reliable rarely obscured from cameras and clothing. locating faces directly may not be possible due to occlusion, pose variations, or a relatively small size of the face region compared to the whole image. Make Robust Multi-Target Tracker: Instead of using detection and classification results to guide the tracker, we can use Coupling Detection and Tracking Using Tracklet: update tracking model by the detection confidence density. Using Spatio/Temporal knowledge like motion tracker and object attributes. Example: Detection fails in occluded situation, tracker may help. In case of abrupt motion changes tracker drifts, detection can be helpful. 12/1/2017 Slide 50/90

51 References 2006_A survey of advances in vision-based human motion capture and analysis 2006_Object Tracking_ a survey 2006_Video Tracking_ A Concise Survey 2006_Tracking People in Crowded Scenes 2008_Machine Recognition of Human Activities_ A Survey 2008_Crowd analysis- A survey 2008_people Tracking by Associating Tracklets using Detection Residuals 2008_Robust Object Tracking by Hierarchical Association of Detection Responses 2009_Monocular people detection_ Survey and experiments 2010_survey people Detection for Advanced Driver Assistance Systems 2010_Crowd analysis using computer vision techniques 2011_A Review of Computer Vision Techniques for the Analysis of Urban Traffic 2011_Stable Multi-Target Tracking in Real-Time Surveillance Video 2011_Online Multiperson Tracking-by-Detection from a Single, Uncalibrated Camera 2012_people Detection An Evaluation of the State of the Art 2013_Monocular Visual Scene Understanding Multi-Object Traffic Scenes 12/1/2017 Slide 51/90