You are here

What Makes Big Visual Data Hard? Alyosha Efros, CMU

Alyosha Efros, an associate professor at the Robotics Institute and the Computer Science Department at Carnegie Mellon University gave a talk in the Big Data Lecture Series --  Fall 2012 at MIT CSAIL. In his talk he discussed how data-driven techniques can make use of big visual data to tackle computer vision problems which are very hard to model parametrically.

Referring to “Unreasonable Effectiveness of Data, Halevy, Norvig, Pereira, 2009”, he claimed that although parts of our world such as physics, chemistry, astronomy, etc. can be explained by elegant mathematics, much such as psychology, genetics, economics, visual understanding, etc. can not. It’s only a huge amount of data that can help in problems that are hard to explain mathematically. Giving Google as an example, he explained how Big Data has made great advances in fields such as speech recognition, machine translation, etc. Even if algorithms are stupid, with lots of data it can achieve unreasonable effectiveness. 
We are already seeing a data deluge. There are an estimated 3.5 trillion photographs in the world, of which 10% have been taken in the past 12 months.  Facebook alone reports 6 billion photo uploads per month.  Every minute, 72 hours of video are uploaded to YouTube. Cisco estimates that in the next few years, visual data (photos and video) will account for over 85% of total internet traffic.  
Unfortunately though, there don't exist effective computational methods for making sense of all this mass of visual data.  Visual data is difficult to handle. Unlike text which is clean, segmented, compact, one dimensional and indexable, visual content is noisy, unsegmented, high entropy and often multi-dimensional.  Visual data is Internet's "digital dark matter" [Perona,2010] -- it's just sitting there!  The central problem is that there is no good measure of similarity for visual data. Two images could be totally similar without even a pixel matching.
Current visual similarity is an extension of text similarity. Images are indexed on visual words (each visual word is a small patch in the image) and similarity is computed based on visual word matching. This technique works for exactly similar images but when the lighting, viewpoint etc.. are changed, it just doesn’t work. 
Understanding visual correspondence is a challenging problem and the solution is -- add more and more data. Once we put a lot of data in the system, even basic distance metrics (applied on patches) start making a lot of sense. Adding lots and lots of visual data helps for more common scenes but there are still going to be many rare kind of images for which we still need to find a way to understand visual correspondences. One way is to score different patches differently. For most rare images, there are some patches that distinguish them from other images; once we know those patches that appear in most other images, we can conclude that they are not so important and thus can be given a lower weight. It would allow the system to find the visual correspondence based on the visually important patches in the image.
As an application of the above mentioned approach, he used Google Street View data for the entire city of Paris to understand "What makes Paris look like Paris?". The goal was to identify the visual elements that capture paris based on the characteristics like frequency (occur often in Paris) and discriminative (are not found outside Paris). While the idea is very similar to tf-idf, the one thing that makes this a lot more difficult in visual data is that unlike text where unit of the tf-idf index is a word, the unit of visual element is unknown. To solve this, he explained a machine learning approach which starts with clustering initial patches followed by finding those clusters that are mostly Parisian and then refine the clusters by making them more Parisian iteratively until convergence.
He concluded with an emphasis on the fact that Big Visual Data is one of the Biggests Data that is there. It is not easy to handle -- it’s not clear what a visual unit is, there is no good distance metrics and thus to make sense of it, we need an interdisciplinary solution from the research communities such as vision, learning, systems, databases, theory etc.