You are here


Big Data promises a better world.  A world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources.  These changes are enabled by a proliferation of new technologies and tools that have the ability to measure, monitor, record, combine and query all kinds of data about us and the world around us -- but how will that data get used and for what purpose?  Who owns the data?   How do we assure accountability for misuse? 

Just as Big Data lays out many promises, it lays out many questions and challenges when it comes to privacy.  We must think carefully about the role of technology and how we design and engineer next generation systems to appropriately protect and manage privacy, in particular within the context of how policy and laws are developed to protect personal privacy.   Decisions about how to address privacy in big data systems will impact almost everyone as we push to make more data open and available inside organizations and publicly. Governments around the world are pushing themselves and private companies to make data transparent and accessible.  Some of this will be personal data.  We will need new tools and technologies for analysis, for anonymizing data, for running queries over encrypted data, for auditing and tracking information, and for managing and sharing our own personal data in the future.   Because issues of data privacy will be relevant across so many aspects of our life, including banking, insurance, medical, public health, government, etc, we believe it is important to collectively address major challenges managing data privacy in a big data world.

Workshop: Big Data Privacy
Exploring the Future Role of Technology in Protecting Privacy

The goal of this workshop [held in June 2013] is to bring together a select group of thought leaders, from academia, industry and government, to focus on the future of Big Data and some of the unique issues and challenges around data privacy.  Our aim is to think longer term (5 years +) and better understand and help define the role of technology in protecting and managing privacy particularly when large and diverse data sets are collected and combined.  We will use the workshop to collectively articulate major challenges and begin to layout a roadmap for future research and technology needs.

This workshop was supported by the MIT Big Data Initiative at CSAIL and by a Grant from The Alfred P. Sloan Foundation.



>> [Nov 2013]  BIG DATA PRIVACY WORKING GROUP planning (members only)

MIT White House Big Data Privacy Workshop
Advancing the State of the Art in Technology and Practice

The White House Office of Science and Technology Policy (OSTP) and MIT co-hosted a public workshop entitled “Big Data Privacy: Advancing the State of the Art in Technology and Practice” on March 3, 2014. The event was part of a series of workshops on big data and privacy organized by the MIT Big Data Initiative at CSAIL and the MIT Information Policy Project. The workshop was also the first in a series of events being held across the country in response to President Obama’s call for a review of privacy issues in the context of increased digital information and the computing power to process it.

The workshop convened key stakeholders and thought leaders from across academia, government, industry, and civil society for a thoughtful dialogue on the future role of technology in protecting and managing privacy. Concentrations included core technical challenges associated with big data applications and provide a theoretical grounding for privacy considerations in large-scale information systems. State of the art in privacy-protecting technologies and how they can be applied to a diversity of big data applications were explored.

Topics included:

    Big Data Opportunities and Risks
    State of the Art of Privacy Protection
    Review of Emerging Privacy Technologies
    Industry, Government, Academic Roundtable

The workshop was co-hosted by: The White House Office of Science and Technology Policy, Massachusetts Institute of Technology, MIT Big Data Initiative at CSAIL, MIT Information Policy Project at CSAIL and Computer Science and Artificial Intelligence Laboratory



Defining “Privacy” in a Big Data World

David Vladeck - Georgetown University Law Center

Professor Vladeck served as former Director of the US Federal Trade Commission, Bureau of Consumer Protection

How can we make sure that we harness the power of big data effectively without sacrificing personal privacy completely?



In the financial industry privacy is a tremendously hotly contested issue.  The problem is that in the financial system, where we don’t use patents to protect our intellectual property, we use trade secrecy, we equate data privacy with profitability.  This is "big data versus big dollars".



In a report, “Personal Data: Emergence of a New Asset Class,” prepared for the World Economic Forum, we proposed a "New Deal on Data" framework with a vision of ownership rights, personal data stores, and peer-to-peer contract law.  The report findings helped to shape the EU Human Rights on Data document and the US Consumer Privacy Bill of Rights.  Among the proposals in the report was the notion that there could be a combination of informed consent and contract law that allows for auditing of data about oneself.  The personal data would also have meta data that  accompany the personal data, showing provenance, permissions, context, and ownership.  At MIT, a open source version of such a scheme has been created in the form of Open Personal Data Store (openPDS) [ref:]



I think there is a great promise for the marriage of Big Data and Differential Privacy. Big data brings with it a promise for research and society, but often the data contains detailed sensitive information about individuals, making privacy a real issue. Heuristic privacy protection techniques were designed for an information regime very different from today's. Many failures were demonstrated in the last decade.  Differential privacy provides provable guarantee for individuals and provides good utility on larger datasets.



The issues relating to data privacy in the real world exhibit some similarities and some differences across all kinds of different domains.  In the medical environment, there is medical data, genomic data or other research data based on private information about patients and subjects.  Functional uses of the private information include finding correlations between a disease and a geographic region, or between a genome and disease.  In an advertising context, social media firms focus on the clicks and browsing habits of their users, assessing trends by region, age, gender, or other distinguishing features of the user population.  Private information includes an individual’s personal profile and “friends”.  Functional uses of the data could entail a prompt to recommend certain things to certain groups of users or to produce ads targeted to users based on their social networks.  Different applications will have different requirements for the level of privacy needed.  The big challenge in any of these applications is: "How do you really trade off privacy for utility?"



Much of the research in the area of data privacy focuses on controlling access to the data.  But, as we have seen, it is possible to break these kinds of systems.  You can in fact infer private information from anonymized data sets, examples include the re-identification of medical records, exposure of sexual orientation on Facebook, and breaking the anonymity of the Netflix prize dataset.  What we are proposing is an accountability approach to privacy, when security approaches are insufficient.  The accountability approach is a supplement to, and not a replacement for upfront prevention.



In 2012 hackers extracted 6.5 million hashed passwords from LinkedIn’s database and were able to reverse most of them. This is a problem we all are familiar with: confidentiality leaks. There are many reasons why data leaks, and for the purpose of this talk I’m going to group them into two threats. First, consider the layout of an application that has data stored in a database.  The first threat is attacks to the database server.  These are attacks in which an adversary could get full access to the database server, but does not modify the data, it just reads it.   The second threat is more general and includes any attacks, passive or active, to any part of the servers.  For example, hackers today can infiltrate the application systems and even obtain root access.  How do we protect data confidentiality in the face of these threats?