BayesDB: A Bayesian Database Table

BayesDB, a Bayesian database table, lets users query the probable implications of their tabular data as easily as an SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.


Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library...(See More)


This research focuses on building a Living Lab for the MIT. Living Lab will enable members of the community to collect their personal data. The Living Lab platform will analyze, visualize and create applications for the rest of the community that run on top of everyone's personal data. Personal data store for everyone at MIT will be created in privacy preserving manner...(See More)

Interfacing with Big Data Repositories

START, the world's first Web-based question answering system, has been on-line and continuously operating since December, 1993. It has been developed by Boris Katz and his associates of the InfoLab Group at the MIT Computer Science and Artificial Intelligence Laboratory. Unlike information retrieval systems (e.g., search engines), START aims to supply users with "just the right information," instead of merely providing a list of hits...(See More)

WikiScout: Generalized Knowledge From Specific Examples

This research works towards delivering access to Big Data using natural language processing. WikiScout system auomatically generates natural language annotation for semistructured data in wikipedia info boxes...(See More)

Declarative, Graphical Construction of Complex Report Queries

Many use cases for business-oriented databases involve the creation of tailor-made summaries known as "reports". Report development is tedious because multiple SQL queries may be required to generate a single report, because queries may include complex combinations of formulas and aggregate functions (e.g. averages of totals), and because the visual output layout of non-tabular results must be manually defined through the use of templating languages or a graphical form editor.

Data Tamer: a data curation system

A next generation Data Curation System. It is a collection of all the curation componenets into the integrated system called Data Tamer...(See More)

SILO: in-memory database

Silo is a new in-memory databse that achieves excellent performance and scalability in modern multicore machines...(See More)


DBWipes defines a notion of influence that describes how much a particular set of input data points affect an aggregated output result. It uses several algorithms to generate human-readable predicates over the input data set that most influences the user-selected outlier results. It aims to to help non-technical end-users engage in the data analysis process...(See More)


CARTILAGE: Adding Flexibility to the Hadoop Skeleton

CARTILAGE, a comprehensive data storage frame-work built on top of HDFS. CARTILAGE allows users full controlover their data storage, including data partitioning, data replication, data layouts, and data placement...(See More)



Subscribe to bigdata@CSAIL RSS