First International Workshop on Visual Analysis and Geo-Localization of Large-Scale Imagery

In conjunction with ECCV 2012, Firenze, Italy October 7, 2012


/ECCV logos


Call for Papers | Submission | Committees | Invited Talks | Best Paper Award | Program

Large-scale geolocalization of imagery is a challenging task that has received significant attention in recent years, largely due to the increased availability of geo-tagged images on social and photo-sharing websites as well as the improved computational power of modern computers. Challenges related to visual analysis and geo-localization of large-scale imagery arise in a variety of contexts. Consumer and intelligence analysts alike may be interested in determining when and where an image/video was taken, who is in the image, what the different objects in the depicted scene are, how they are related to each other. Local law enforcement agencies may be interested in obtaining content statistics from imagery from around a city location about population density, ethnic distribution, etc. to determine their logistics. Similarly, local businesses may utilize content statistics to target market their products based on the 'where', 'what', 'when', and 'how' that may automatically be extracted during visual analysis and geo-localization of large-scale imagery. Developing mathematical models of individuals and objects and their interaction for simulation and modeling purposes is yet another area of interest.

We believe research on visual analysis ('what', 'when', and 'how') and geo-localization ('where') of large-scale imagery can greatly benefit by bringing together researchers from areas of computer vision, computer graphics, photogrammetry, computational optimization and other related fields. Such a gathering will lay down a foundation for an integrated context-based visual analysis approach for building earth-scale models, where complementary viewpoints and techniques from these areas are used to develop additional insight into the visual analysis and geo-localization problem. The focus of this workshop will be on exchange of ideas on how to develop visual analysis and geo-localization capabilities that make use of vast amount of contextual information available on the internet. As a byproduct, the relevant communities will also benefit as this workshop will lead to improved methods for data-driven modeling and analysis of large-scale imagery. Papers describing novel and original research are solicited in the areas related to visual analysis and geo-localization of large-scale imagery. Topics of interest include but not limited to:

  • Visual Feature and Information Extraction from Large-Scale Imagery
  • Understanding and Modeling Uncertainties in Visual and Geospatial Data
  • Semantic Generalization of Visual and Geospatial Data
  • Representation, Indexing, Storage, and Analysis of City-to-Earth Scale Models
  • Automated 3D Modeling Pipelines for Complex Large-Scale Architectures
  • Integrated Processing of Point Clouds, Image, and Video Data
  • Multi-Modal Visual Sensor Data Fusion
  • Control Mechanisms that aid in Visual Analysis and Geo-Localization
  • Rendering and Visualization of Large-Scale Models, Semantic Labels and Imagery
  • Applications of Visual Analysis and Geo-Localization of Large-Scale Imagery
  • Datasets and Model Validation Methods for Analysis and Geo-localization Research

Call for Papers 



Papers should describe original and unpublished work about the above or closely related topics. Each paper will receive double blind reviews, moderated by the workshop chairs. Authors should take into account the following:

  • All papers must be written in English and submitted in PDF format.
  • Papers must be submitted online through the ECCV 2012 submission CMT system.
  • The maximum paper length is 10 pages. The workshop paper format guidelines are the same as the Main Conference papers.
  • Submissions will be rejected without review if they: contain more than 10 pages, violate the double-blind policy or violate the dual-submission policy.
  • Authors will have the opportunity to submit up to 30MB of supporting material.

The author kit provides a LaTeX2e template for submissions, and an example paper to demonstrate the format. Please refer to this example for detailed formatting instructions. If you intend to use something other than LaTeX to prepare your paper (e.g. MS Word), you may refer to this Springer LNCS site.

A paper ID will be allocated to you during submission. Please replace the asterisks in the example paper with your paper's own ID before uploading your file.

Important Dates

Submissions deadline:
Author notification:
July 7, 2012
July 31, 2012
August 7, 2012
October 7, 2012



General Chairs

Mubarak Shah
Luc Van Gool
ETH, Switzerland


Program Chairs

Asaad Hakeem
Alexei Efros
Niels Haering
James Hays
Hui Cheng
ObjectVideo, USA
ObjectVideo, USA
Brown Univ., USA
SRI Sarnoff, USA

Program Committee

Josef Sivic
Shih-Fu Chang
Rama Chellappa
Saad Ali
Robert Pless
Omar Javed
Andrew Bagnell
Yaser Sheikh
Nathan Jacobs
Himaanshu Gupta
Arslan Basharat
Serge Belongie
David Crandall
Zeeshan Rasheed
Anthony Hoogs
Jana Kosecka
Giovanni Marchisio
Daniel Huttenlocher
B. S. Manjunath
Chris Stauffer
Antonio Torralba
Alper Yilmaz
Grant Schindler
Ram Nevatia
Mei Han
INRIA, France
Columbia Univ., USA
SRI Sarnoff, USA
Washington Univ., USA
SRI Sarnoff, USA
Univ. of Kentucky, USA
ObjectVideo, USA
Kitware, USA
Indiana Univ., USA
ObjectVideo, USA
Kitware, USA
DigitalGlobe, USA
Cornell Univ., USA
BAE Systems, USA
Georgia Tech, USA
Google Research, USA


Josef Sivic
Alexei Efros
Till Quack
Hui Cheng
Noah Snavely
Cordelia Schmid
INRIA, France
Carnegie Mellon University, USA
Kooaba, Switzerland
SRI International Sarnoff
Cornell University, USA
INRIA, France

Representations for place recognition: from visual words to view-dependent contours
Josef Sivic, INRIA/Ecole Normale Superieure, Paris

Abstract: We consider the problem of visual place recognition: given the query image of a particular street or a building facade, the objective is to find one or more images in the geotagged database depicting the same place. This problem can be cast as image retrieval using the efficient bag-of-visual-words representation. In contrast to image retrieval, where the database is typically an unstructured collection of images, the place recognition databases are often structured: images have geotags, are localized on a map and depict a consistent 3D world. Taking advantage of this structure can lead to new more accurate and efficient representations.

First, we show that the 2D map structure of street-view imagery can lead to efficient matching scheme based on linear interpolation of descriptors from neighboring views. Second, we demonstrate that geo-tags can be used as a form of weak supervision to select and learn local scene features that are distinctive for a particular place or a geo-spatial area. Finally, we show that the recovered 3D structure can be used to extract view-dependent contours that enable visual localization in situations where local invariant features fail, for example localizing non-photographic depictions such as paintings or drawings.

Results will be shown on collections of Internet images from Flickr and street-view.

Bio: Josef Sivic received a degree from the Czech Technical University,Prague, in 2002 and the PhD degree from the University of Oxford in 2006. His thesis dealing with efficient visual search of images and videos was awarded the British Machine Vision Association 2007 Sullivan Thesis Prize and was short listed for the British Computer Society 2007 Distinguished Dissertation Award. His research interests include visual search and object recognition applied to large image and video collections. After spending six months as a postdoctoral researcher in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology, he currently holds a permanent position as an INRIA researcher at the Departement d’Informatique, Ecole Normale Superieure, Paris. He has published over 40 scientific publications and serves as an Associate Editor for the International Journal of Computer Vision.

"What makes Paris look like Paris?": mining geo-informative visual features
Alexei Efros, CMU

Abstract: Visual geo-location is frequently posed as very-large-scale exact place recognition problem. This works extremely well for many practical tasks where data is densely sampled (e.g. Google Car), but if the exact query location is not in the database, the algorithm is completely at a loss. Instead, here I will address the problem of coarse-scale visual geo-location -- localizing an image on the scale of a city, a region, or a country given a sparsely sampled database. Interestingly, while people are surprisingly good at this task [Hays and Efros, VSS'09], it is not clear how they do it.

In this talk, I will first present an overview of our work on data-driven Earth-scale geo-location of images (im2gps) and image sequences and compare it with human performance. The types of images where humans are beating computers will be used to motivate our recent work on finding local geo-informative visual features. Given a large repository of geotagged imagery, we seek to automatically find visual elements, e.g. windows, balconies, street signs, that are most distinctive for a certain geo-spatial area, for example the city of Paris. We show that geographically representative image elements can be discovered automatically from Google Street View imagery in a discriminative manner. We demonstrate that these elements are visually interpretable and perceptually geo-informative. The discovered visual elements can also support a variety of computational geography tasks, such as mapping architectural correspondences and influences within and across cities, finding representative elements at different geo-spatial scales, and geographically-informed image retrieval.

Bio: Alexei "Alyosha" Efros is an associate professor at the Robotics Institute and the Computer Science Department at Carnegie Mellon University. His research is in the area of computer vision and computer graphics, especially at the intersection of the two. He is particularly interested in using data-driven techniques to tackle problems which are very hard to model parametrically but where large quantities of data are readily available. Alyosha received his PhD in 2003 from UC Berkeley under Jitendra Malik and spent the following year as a post-doctoral fellow in Andrew Zisserman's group in Oxford, England. Alyosha is a recipient of CVPR Best Paper Award (2006), NSF CAREER award (2006), Sloan Fellowship (2008), Guggenheim Fellowship (2008), Okawa Grant (2008), Finmeccanica Career Development Chair (2010), SIGGRAPH Significant New Researcher Award (2010), and ECCV Best Paper Honorable Mention (2010).

Large-scale Mining of Landmark Databases for Fun and Profit
Till Quack, Kooaba

tillAbstract: Real-world systems for visual search, augmented reality or photo tagging require large-scale databases. This talk focuses on the construction of such databases in the area of landmark recognition. The databases should be created without manual intervention by crawling the data from the web. Not only the landmark clusters themselves have to be crawled, they also have to be labeled with useful descriptions, such that applications can offer a real benefit to the user. We present our early work on the topic, put it into context with current research and also recent commercialization efforts.

Bio: Till Quack is co-founder and CTO of kooaba AG. He obtained a M.Sc. degree in Information Technology and Electrical Engineering from ETH Zurich in 2004. From 2004 to 2009, he was research assistant and Ph.D. student at the Computer Vision Group of the Swiss Federal Institute of Technology (ETH), Zurich, Switzerland with Prof. Luc Van Gool. He received his Ph.D. in Computer Science ( from ETH Zurich for his work on mining and retrieval of visual data in a multimodal context in 2009. His recent research interests include multi-modal mining of data from community photo collections.

Geo-Locating Photos and Videos over Urban and Natural Terrain
Hui Cheng, SRI international Sarnoff

huiAbstract: Photos and videos taken from cell phones, laptop cameras and other devices often contain critical information about the locations of the people, buildings and objects seen in the imagery. The widespread availability of these images and videos has greatly expanded the volume of information requiring labor-intensive analysis. To address this problem, SRI has developed and demonstrated a prototype system for geo-location of photos and video frames using hierarchical analysis and exploitation of urban and natural terrain features. The system reduces geo-location workload while improving accuracy.

Bio: Dr. Hui Cheng is the Sr. Technical Manager of Cognitive and Adaptive Vision Systems Group at SRI International Sarnoff. He received his Ph.D. in Electrical Engineering from Purdue University. His research focuses in the area of image/video understanding, data mining, machine learning, pattern analysis, cognitive systems and informatics. Dr. Cheng has led a number of programs in the area of full motion video, wide-area surveillance video and web video exploitation, summarization and indexing. He is also the technical lead in geo-location of metadata-free photos and videos and automated behavior analysis and event detection for surveillance, training and social media understanding. Dr. Cheng has more than 40 publications and fifteen patents. He is a senior member of IEEE, a member of IEEE Technical Committee on Multimedia Systems and Applications and the Chair of Princeton/Central Jersey Chapter, IEEE Signal Processing Society.

Building a World-Scale Geometric Database
Noah Snavely, Cornell University

Abstract: We live in a world of ubiquitous imagery, in which the number of images at our fingertips is growing at a seemingly exponential rate. These images come from a wide variety of sources, including millions of photographers around the world uploading billions and billions of images to social media and photo-sharing websites such as Facebook. Our goal is to reconstruct as much of the world in 3D as we can from this data, and to use the resulting data in applications such as pixel-accurate location recognition -- recognizing the camera matrix of an image in a geographic coordinate system to pixel precision. This talk describes our recent work on large-scale 3D modeling and world-wide camera pose estimation.

Bio: Noah Snavely is an assistant professor of Computer Science at Cornell University, where he has been on the faculty since 2009. He received a B.S. in Computer Science and Mathematics from the University of Arizona in 2003, and a Ph.D. in Computer Science and Engineering from the University of Washington in 2008. Noah works in computer graphics and computer vision, with a particular interest in using vast amounts of imagery from the Internet to reconstruct and visualize our world in 3D, and in creating new tools for enabling people to capture and share their environments. Noah is the recipient of a Microsoft New Faculty Fellowship and an NSF CAREER Award, and has been recognized by Technology Review's TR35.

Aggregating local image descriptors for large-scale image retrieval and classification
Cordelia Schmid, INRIA Rhone-Alpes

Abstract: We address the problems of large scale image retrieval and classification. In both cases an appropriate image representation is important. We present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension.

In the context of large-scale image retrieval we show how to jointly optimize dimensionality reduction and indexing in order to obtain a precise vector comparison as well as a compact representation. We show that the image representation can be reduced to a few dozen bytes while preserving high accuracy. Searching a 100 million image dataset takes about 250 ms on one processor core. For large-scale image classification we show and interpret the importance of an appropriate vector normalization. Furthermore, we discuss how to learn given a large number of classes and images with stochastic gradient descent and show results on ImageNet10k.

Bio: Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate, also in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis on "Local Greyvalue Invariants for Image Matching and Retrieval" received the best thesis award from INPG in 1996. She received the Habilitation degree in 2001 for her thesis entitled "From Image Matching to Learning Visual Models". Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at INRIA Rhone-Alpes, where she is a research director and directs the INRIA team called LEAR for LEArning and Recognition in Vision. Dr. Schmid is the author of over a hundred technical publications. She has been an Associate Editor for IEEE PAMI (2001--2005) and for JJCV (2004---), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015. In 2006, she was awarded the Longuet-Higgins prize for fundamental contributions in computer vision that have withstood the test of time. She is a fellow of IEEE.




There will be a best paper award recommended during the peer review by the program committee and selected by the workshop chairs. The winner will receive a recognition certificate and a check for $500 USD sponsored by ObjectVideo.




09:00-09:20 Opening notes from Workshop Organizers
09:20-10:00 Invited Speaker - Josef Sivic INRIA, France
10:00-10:40 Invited Speaker - Alexei Efros Carnegie Mellon University, USA
10:40-11:00 Break
11:00-11:40 Invited Speaker - Till Quack Kooaba, Switzerland
11:40-12:20 Invited Speaker - Cordelia Schmid INRIA, France
12:20-12:40 Oral Presentation - 77: A memory efficient discriminative approach for location aided recognition
12:40-13:00 Oral Presentation - 80: Ultra-wide Baseline Facade Matching for Geo-Localization
13:00-14:30 Lunch
14:30-15:10 Invited Speaker - Noah Snavely Cornell University, USA
15:10-16:10 Panel Discussion: Jill Crisman, Rama Chellappa, Marc Pollefeys, Ram Nevatia, Bastian Leibe

/ECCV logos