Motion Pattern Matching Using Divergence Minimization




Introduction



This paper proposes a novel method for recognition and classification of events represented by Mixture distributions of location and flow. The main idea is to classify observed events into semantically meaningful groups even when mo- tion is observed from distinct viewpoints. Events in the pro- posed framework are modeled as motion patterns, which are represented by mixtures of multivariate Gaussians, and are obtained by hierarchical clustering of optical flow in the four dimensional space (x, y, u, v). Such motion pat- terns observed from varying viewpoints, and in distinct lo- cations or datasets, can be compared using different fami- lies of divergences between statistical distributions, given that a transformation between the views is known. One of the major contributions of this paper is to compare and match two motion pattern mixture distributions by estimat- ing the similarity transformation between them, that min- imizes their Kullback−Leibler (KL) divergence. The KL di- vergence between Gaussian mixtures is approximated by Monte Carlo sampling, and the minimization is accom- plished by employing an iterative nonlinear least squares estimation method, which bears close resemblance to the Iterative Closest Point (ICP) algorithm. We present a ro- bust framework for matching of high-dimensional, sampled point sets representing statistical distributions, by defining similarity measures between them, for global energy min- imization. The proposed approach is tested for classifica- tion of events observed across several datasets, captured from both static and moving cameras, involving real world pedestrian as well as vehicular motion. Encouraging re- sults are obtained which demonstrate the feasibility and va- lidity of the proposed approach.



Event Representation



Our choice for the representation is a mixture of Gaussian model, which is learned over the four dimensional space (x, y, u, v), where (x, y) corresponds to a pixel location, and (u, v) is the instantaneous flow observed at (x, y). The reason for this choice is that this is a rich generative representation which is ideally suited to obtaining an analytical form for the representation of transformed motion patterns, as opposed to histogram representations and topic models.



Problem Description



Although location of observed flow is an important cue and constraint in grouping and analysis of spatiotemporal flow data, the same constraint transforms into the limitation of extremely view-dependent representations, which in turn makes comparison and recognition much harder. Due to the presence of spatial dimension in these representations, the above mentioned methods are unable to perform generalized event recognition or classification, for behaviors captured from varying view points, for videos of distinct datasets, or moving camera videos. As opposed to human action videos which often capture frontal pose of the actors, from ground level viewpoints, events related to surveillance scenarios lend themselves appropriately to exploration of the task under consideration, i.e., transformation invariant event matching and recognition. This is due in part to the fact that surveillance videos depict a a reasonably wide field of view of the scene, and are often captured from high oblique to nadir viewpoints. The proposed similarity transformation therefore is well suited to view invariant matching of events, given that due to overhead views, the most dominant components of transformation are mostly related to rotation and scaling, in addition to translation, as opposed to perspective. Given input videos, the goal of the proposed framework therefore, is to estimate a measure of similarity between representations of two observed event instances, F and G, regardless of the view from which they were observed. This goal can be achieved in three main steps: (i) Automatic detection of events and their representation as probabilistic distributions, f(x) and g(x); (ii) Analysis of the effect of a transformation, T, on the multivariate distribution, g(x), to analytically obtain a transformed distribution, gT(x); and (iii) Estimation of the transformation, T, such that some measure of dissimilarity or divergence, D, between f and gT is minimized. A test example of an event can then successfully be matched against all learned event models for the purpose of classification or labeling.



Algorithm



given two events, represented by f(x) and g(x) respectively, we can compute a measure of dissimilarity, Z, between them as, Z (f, g) = D (f || gt) , (8) where D is chosen to be the KL divergence. There however, is no closed form expression for KL divergence between Gaussian mixtures and various approximations are often used. In this paper, divergence is approximated by Monte Carlo sampling over the first distribution, f. A large set of N, 4d points, is thus sampled from the mixture distribution f, where N is typically equal to 1000 in our experiments. We then prove in our work that the problem of estimating the optimal transformation that minimizes the KL divergence, is equivalent to maximum likelihood estimation (MLE) of the Gaussian mixture parameters given the set of sampled points. Another equivalent way of describing this problem is that the goal is to estimate the transformation, T, such that the inverse transformed sampled points have the maximum joint likelihood, given the parameters of g. Notice however, that this is not an easier problem compared to estimation of mixture parameters (e.g., using Expectation Maximization), and an analytical expression of the joint likelihood derivative, with respect to even the linear transformation T, is non-trivial. We therefore propose an approximate, iterative maximization algorithm that shares similarities with the Iterative Closest Point (ICP) algorithm. Given the point set sampled from f, which we now write as Xf , we assume the initial value of the transformation to be identity, and write it as T0. In other words, R0 = I2x2, and t0 = [0 0]. The proposed maximization algorithm then begins by first sampling another set of N points, XgT0 , from gT0 = g. At each iteration, i >= 1, the goal of the maximization process is to iteratively find 1-1 correspondence between the point sets, Xf and XgTi-1 , and compute intermediate transformation, ˆTi, so that, Ti = ˆTiTi-1.

Experiments and Results

We test our method on 11 event types commonly found in traffic scenarios using two different measure. KL divergence and minimum graph cost at final iteration of transformation.

Confusion table using KL divergence:


Confusion table using graph cost


Presentation

PowerPoint