As part of the Big Data Lecture Series — Fall 2012, Ion Stoica, a Professor at the EECS Department at University of California, Berkeley gave a talk on BDAS (Berkeley Data Analytics Stack). BDAS is an open-source data analytics stack for complex computations on massive data which supports efficient, large-scale in-memory data processing, and allows users and applications to trade between query accuracy, time, and cost.
Data is as useful as the decision it enables. The one most important goal of bigdata is to extract "value" through sophisticated exploratory analysis, and use it as the basis to make decisions such as personalized treatment and advertisment targeting. Data analytics tools must provide low latency computations on massive datasets for both historical and live data.
Today's data analytics tools do provide sophisticated processing on massive data but they are slow. Achieving very low latency (less than 1 second latency) on massive data is hard because data is usually stored over thousands of partitions and even if all data were in memory, it takes 10s of seconds to just scan memory of a large server. This makes these tools less suitable for real time complex computations, such as machine learning algorithms.
Some of the strategies for achieving low latency are data sampling, precomputation, reduce i/o (in memory processing and column store), increase parallelism, and fine grain sharing. Berkely Data Analytics Stack leverages open analytics stack which consists of four layers -- infrastructure layer, storage layer, data processing layer and the application layer. BDAS adds an extra layer between infrastructure and storage for resource management which uses fine grain sharing and thus enables multiple frameworks to share the same infrastructure resulting in increased parallelism. BDAS extends the storage layer to enable better data management by adding the capability of in-memory processing. BDAS also extends the data processing layer by adding the capabilities such as precomputation, in-memory processing and data sampling. In this talk he focussed mostly on these three layers and two systems -- Sparrow : a high throughput, very low latency scheduler, and BlinkDB : an approximate query processing engine that allows users to trade between accuracy, time, and cost.
Mesos is the BDAS resource management layer which allows multiple frameworks to share cluster. Today, people run each instance of framework like Hadoop or MPI in its own statically configured partition. Mesos allows multiple such instances to share resources and data. Mesos virtualizes resources to frameworks by providing fine grain sharing and isolation across frameworks.
At the data processing layer, Spark is a framework built over Mesos which is designed for low-latency and iterative computation on historical data. It provides fault-tolerant and efficient memory abstraction called Resilient Distributed Database (RDDs). Mesos also supports existing data processing frameworks like Hadoop and MPI. Shark is a SQL interface on top of Spark, and Streaming extends Spark to provide streaming functionality and low latency computations on live data.
Cheetah is the data management layer of BDAS. It provides high throughput, fault tolerant in-memory storage with shared RDDs. It allows frameworks other than spark, such as Hadoop and MPI to reliably share data.
All BDAS frameworks share two goals -- increasing parallelism and smaller response time. Sparrow is a de-centralized scheduler designed for high availability, millisecond level latency, and millions of decisions per second. The main idea behind Sparrow is batch sampling. For an n-task jobs, it probs n*d random servers (d >= 1) and picks least loaded n servers to run job’s tasks.
BlinkDB is a query processing framework which allows users to trade between accuracy, time, and cost. When dealing with massive data, even if all of it is in memory, query may take 10s of seconds because just scanning 200-300 GB RAM can take upto 10 seconds. This is too slow for real time processing and even interactive queries. Since exact results are not necessary for many decision making, BlinkDB allows users to specify for each query an error bound and a time bound. A sample query could look like SELECT avg(sessionTime) FROM table WHERE city = ‘SFO’ AND date = ‘04-25-1986’ WITHIN 1 SECONDS, for which the response could be -- “200.85 +- 11.45”. BlinkDB uses the strategy of offline sampling to maintain uniform random samples, and stratified samples on different sets of columns for different granularities. Upon receiving a query, BlinkDB Query Plan invokes the Sample Selection Engine which builds Error Latency profile for predicting the time and error of the query running on different samples. The Sample Selection Engine picks the best sample to satisfy users’ time and accuracy constraints.
In summary, all these components of BDAS work together in order to provide a very low latency computations in massive historical and live data.