In conjunction with ECCV 2012, Firenze, Italy October 7, 2012
Large-scale geolocalization of imagery is a challenging task that has received significant attention in recent years, largely due to the increased availability of geo-tagged images on social and photo-sharing websites as well as the improved computational power of modern computers. Challenges related to visual analysis and geo-localization of large-scale imagery arise in a variety of contexts. Consumer and intelligence analysts alike may be interested in determining when and where an image/video was taken, who is in the image, what the different objects in the depicted scene are, how they are related to each other. Local law enforcement agencies may be interested in obtaining content statistics from imagery from around a city location about population density, ethnic distribution, etc. to determine their logistics. Similarly, local businesses may utilize content statistics to target market their products based on the 'where', 'what', 'when', and 'how' that may automatically be extracted during visual analysis and geo-localization of large-scale imagery. Developing mathematical models of individuals and objects and their interaction for simulation and modeling purposes is yet another area of interest.
We believe research on visual analysis ('what', 'when', and 'how') and geo-localization ('where') of large-scale imagery can greatly benefit by bringing together researchers from areas of computer vision, computer graphics, photogrammetry, computational optimization and other related fields. Such a gathering will lay down a foundation for an integrated context-based visual analysis approach for building earth-scale models, where complementary viewpoints and techniques from these areas are used to develop additional insight into the visual analysis and geo-localization problem. The focus of this workshop will be on exchange of ideas on how to develop visual analysis and geo-localization capabilities that make use of vast amount of contextual information available on the internet. As a byproduct, the relevant communities will also benefit as this workshop will lead to improved methods for data-driven modeling and analysis of large-scale imagery. Papers describing novel and original research are solicited in the areas related to visual analysis and geo-localization of large-scale imagery. Topics of interest include but not limited to:
Papers should describe original and unpublished work about the above or closely related topics. Each paper will receive double blind reviews, moderated by the workshop chairs. Authors should take into account the following:
The author kit provides a LaTeX2e template for submissions, and an example paper to demonstrate the format. Please refer to this example for detailed formatting instructions. If you intend to use something other than LaTeX to prepare your paper (e.g. MS Word), you may refer to this Springer LNCS site.
A paper ID will be allocated to you during submission. Please replace the asterisks in the example paper with your paper's own ID before uploading your file.
July 7, 2012
July 31, 2012
August 7, 2012
October 7, 2012
Luc Van Gool
Brown Univ., USA
SRI Sarnoff, USA
B. S. Manjunath
Columbia Univ., USA
SRI Sarnoff, USA
Washington Univ., USA
SRI Sarnoff, USA
Univ. of Kentucky, USA
Indiana Univ., USA
Cornell Univ., USA
BAE Systems, USA
Georgia Tech, USA
Google Research, USA
Carnegie Mellon University, USA
SRI International Sarnoff
Cornell University, USA
Representations for place recognition: from visual words to view-dependent contours
Abstract: We consider the problem of visual place recognition: given the query image of a particular street or a building facade, the objective is to find one or more images in the geotagged database depicting the same place. This problem can be cast as image retrieval using the efficient bag-of-visual-words representation. In contrast to image retrieval, where the database is typically an unstructured collection of images, the place recognition databases are often structured: images have geotags, are localized on a map and depict a consistent 3D world. Taking advantage of this structure can lead to new more accurate and efficient representations.
Bio: Josef Sivic received a degree from the Czech Technical University,Prague, in 2002 and the PhD degree from the University of Oxford in 2006. His thesis dealing with efficient visual search of images and videos was awarded the British Machine Vision Association 2007 Sullivan Thesis Prize and was short listed for the British Computer Society 2007 Distinguished Dissertation Award. His research interests include visual search and object recognition applied to large image and video collections. After spending six months as a postdoctoral researcher in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology, he currently holds a permanent position as an INRIA researcher at the Departement d’Informatique, Ecole Normale Superieure, Paris. He has published over 40 scientific publications and serves as an Associate Editor for the International Journal of Computer Vision.
"What makes Paris look like Paris?": mining geo-informative visual features
Abstract: Visual geo-location is frequently posed as very-large-scale exact place recognition problem. This works extremely well for many practical tasks where data is densely sampled (e.g. Google Car), but if the exact query location is not in the database, the algorithm is completely at a loss. Instead, here I will address the problem of coarse-scale visual geo-location -- localizing an image on the scale of a city, a region, or a country given a sparsely sampled database. Interestingly, while people are surprisingly good at this task [Hays and Efros, VSS'09], it is not clear how they do it.
Bio: Alexei "Alyosha" Efros is an associate professor at the Robotics Institute and the Computer Science Department at Carnegie Mellon University. His research is in the area of computer vision and computer graphics, especially at the intersection of the two. He is particularly interested in using data-driven techniques to tackle problems which are very hard to model parametrically but where large quantities of data are readily available. Alyosha received his PhD in 2003 from UC Berkeley under Jitendra Malik and spent the following year as a post-doctoral fellow in Andrew Zisserman's group in Oxford, England. Alyosha is a recipient of CVPR Best Paper Award (2006), NSF CAREER award (2006), Sloan Fellowship (2008), Guggenheim Fellowship (2008), Okawa Grant (2008), Finmeccanica Career Development Chair (2010), SIGGRAPH Significant New Researcher Award (2010), and ECCV Best Paper Honorable Mention (2010).
Large-scale Mining of Landmark Databases for Fun and Profit
Abstract: Real-world systems for visual search, augmented reality or photo tagging require large-scale databases. This talk focuses on the construction of such databases in the area of landmark recognition. The databases should be created without manual intervention by crawling the data from the web. Not only the landmark clusters themselves have to be crawled, they also have to be labeled with useful descriptions, such that applications can offer a real benefit to the user. We present our early work on the topic, put it into context with current research and also recent commercialization efforts.
Bio: Till Quack is co-founder and CTO of kooaba AG. He obtained a M.Sc. degree in Information Technology and Electrical Engineering from ETH Zurich in 2004. From 2004 to 2009, he was research assistant and Ph.D. student at the Computer Vision Group of the Swiss Federal Institute of Technology (ETH), Zurich, Switzerland with Prof. Luc Van Gool. He received his Ph.D. in Computer Science (Dr.sc.techn.) from ETH Zurich for his work on mining and retrieval of visual data in a multimodal context in 2009. His recent research interests include multi-modal mining of data from community photo collections.
Geo-Locating Photos and Videos over Urban and Natural Terrain
Abstract: Photos and videos taken from cell phones, laptop cameras and other devices often contain critical information about the locations of the people, buildings and objects seen in the imagery. The widespread availability of these images and videos has greatly expanded the volume of information requiring labor-intensive analysis. To address this problem, SRI has developed and demonstrated a prototype system for geo-location of photos and video frames using hierarchical analysis and exploitation of urban and natural terrain features. The system reduces geo-location workload while improving accuracy.
Bio: Dr. Hui Cheng is the Sr. Technical Manager of Cognitive and Adaptive Vision Systems Group at SRI International Sarnoff. He received his Ph.D. in Electrical Engineering from Purdue University. His research focuses in the area of image/video understanding, data mining, machine learning, pattern analysis, cognitive systems and informatics. Dr. Cheng has led a number of programs in the area of full motion video, wide-area surveillance video and web video exploitation, summarization and indexing. He is also the technical lead in geo-location of metadata-free photos and videos and automated behavior analysis and event detection for surveillance, training and social media understanding. Dr. Cheng has more than 40 publications and fifteen patents. He is a senior member of IEEE, a member of IEEE Technical Committee on Multimedia Systems and Applications and the Chair of Princeton/Central Jersey Chapter, IEEE Signal Processing Society.
Building a World-Scale Geometric Database
Abstract: We live in a world of ubiquitous imagery, in which the number of images at our fingertips is growing at a seemingly exponential rate. These images come from a wide variety of sources, including millions of photographers around the world uploading billions and billions of images to social media and photo-sharing websites such as Facebook. Our goal is to reconstruct as much of the world in 3D as we can from this data, and to use the resulting data in applications such as pixel-accurate location recognition -- recognizing the camera matrix of an image in a geographic coordinate system to pixel precision. This talk describes our recent work on large-scale 3D modeling and world-wide camera pose estimation.
Bio: Noah Snavely is an assistant professor of Computer Science at Cornell University, where he has been on the faculty since 2009. He received a B.S. in Computer Science and Mathematics from the University of Arizona in 2003, and a Ph.D. in Computer Science and Engineering from the University of Washington in 2008. Noah works in computer graphics and computer vision, with a particular interest in using vast amounts of imagery from the Internet to reconstruct and visualize our world in 3D, and in creating new tools for enabling people to capture and share their environments. Noah is the recipient of a Microsoft New Faculty Fellowship and an NSF CAREER Award, and has been recognized by Technology Review's TR35.
Aggregating local image descriptors for large-scale image retrieval and classification
Abstract: We address the problems of large scale image retrieval and classification. In both cases an appropriate image representation is important. We present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension.
Bio: Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate, also in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis on "Local Greyvalue Invariants for Image Matching and Retrieval" received the best thesis award from INPG in 1996. She received the Habilitation degree in 2001 for her thesis entitled "From Image Matching to Learning Visual Models". Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at INRIA Rhone-Alpes, where she is a research director and directs the INRIA team called LEAR for LEArning and Recognition in Vision. Dr. Schmid is the author of over a hundred technical publications. She has been an Associate Editor for IEEE PAMI (2001--2005) and for JJCV (2004---), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015. In 2006, she was awarded the Longuet-Higgins prize for fundamental contributions in computer vision that have withstood the test of time. She is a fellow of IEEE.
There will be a best paper award recommended during the peer review by the program committee and selected by the workshop chairs. The winner will receive a recognition certificate and a check for $500 USD sponsored by ObjectVideo.