We believe the solution to big data is fundamentally multi-disciplinary. Our approach brings together world leaders in parallel architecture, massive-scale data processing, algorithms, machine learning, visualization, and interfaces to collectively identify and address the fundamental technology challenges we face with Big Data.
Our approach focuses on four broad research themes, summarized below:
We are building several parallel data processing platforms, including SciDB, BlinkDB, and several cloud-based deployment platforms, including FOS and Relational Cloud. The goal of these platforms is to make it easy for developers of big data applications to write programs much as they would on a single-node computational environment, and to be able to rapidly deploy those applications on tens or hundreds of nodes. Additionally, as the computation and storage requirements of applications change, these platforms should be able to dynamically and elastically adapt to those changes.
We are developing a range of algorithms designed to deal with very large volumes of data, and to process that data in parallel. These include parallel implementations of a range of known algorithms, including matrix computations, as well as statistical operations like regression, optimization methods like gradient descent, and machine learning algorithms like clustering and classification. In addition, we are developing fundamental new types of algorithms designed to handle the challenges of Big Data. For example, we are working on sublinear algorithms that can compute a range of statistics, such as estimates of the number of distinct items in a set, using space that is exponentially smaller than the input. Additionally, we are developing new algorithms for encoding, comparing, and searching massive data sets; specific examples include hash-based similarity search on massive scale data, and algorithms for compressed sensing that provide a new way to encode sparse matrices that arise in a number of scientific applications.
Machine Learning and Understanding
On top of these algorithms, we are deploying a number of novel machine learning applications focused on machine understanding in specific domains. For example, in work on scene understanding in images we are building tools that automatically label parts of an image, or that classify an image as belonging to a certain category or categories based on the types of images that appear in them. As a second example, we are using natural language processing to convert massive quantities of text tweets and text reviews on the web into structured information about products, restaurants, and services that indicate the type of content in some text (e.g., a food review, a rating), an assessment of the sentiment of the text, etc.
Privacy and Security
Finally, because much of the mining and analysis involved in a big data context involves sensitive, private information, we are working technologies and policies for protecting, anonymization, and allowing people to retain control over their data. As an example, in the Crypt DB project, we are building a database system that stores data in an encrypted format in the cloud, in such a way that a curious database or system administrator cannot decrypt the data. Users retain the encryption keys over their data, but have the ability to execute queries over that encrypted data on the database serving, enabling much better performance than simply sending the data back an decrypting on the client’s machine.
Work in these four areas is coupled with application experts in finance (Professor Andrew Lo), medicine (Professor John Guttag), science (Professor Michael Stonebraker), education (through a relationship with the MITx initiative), and transportation (Professor Balakrishnan and Professor Madden).