In this work, we address the problem of extracting the geospatial trajectory of the camera from unconstrained videos recorded in a city. The proposed method is intended for user-uploaded videos available on the web, which typically include undesired recording defects, such as blurred or uninformative frames, abrupt changes in camera motion, zooming effect, dynamical environment of the scene such as vehicles and pedestrians occluding distinct features, and lack of information of the initial position and pose (e.g. meta data) where the video was recorded.
The problem that this project deals with is that of detection and tracking of large number of objects
in Wide Area Surveillance (or WAS) data. WAS is aerial data that is characterized by a very large field
of view. The particular dataset that we used for this project is the CLIF dataset. CLIF stands for
Columbus Large Image Format, the dataset is available here.
With the proliferation of wide area video sensor networks, video surveillance, especially in public areas, is gaining importance at an unprecedented rate.
We argue that over the period of its operation, an intelligent tracking system should be able to learn the scene from its observables and be able to
improve its performance based on this model. The high level knowledge necessary to make such inferences derives from domain knowledge, past experiences,
as well as scene geometry, learned traffic and target behavior patterns in the area, etc. This argument forms the basis of this project, where we model
and learn the scene activity, observed by a static camera. The motion patterns of the objects in the scene are modeled as a multivariate non-parametric
probability density function of spatio-temporal variables. Kernel density estimation is used to learn this model in a completely unsupervised fashion,
by observing the trajectories of objects over extended periods of time.