As part of the Big Data Lecture Series -- Fall 2012, Google’s Chris Olston gave a talk on how to manage processing of large data sets. In this talk he gives an overview of his work on large-scale data processing at Yahoo! Research. He begins his talk by introducing two data processing systems: Pig, a dataflow programming environment and Hadoop-based runtime, and Nova, a workflow manager for Pig/Hadoop. Rest of the talk focuses on debugging, and looks at what can be done before, during and after execution of a data processing operation.
For most of the Internet companies such as Google/Yahoo, their entire infrastructure can be represented as a pipeline with three layers: Ingestion, Storage & Processing, and Serving. The most complex and important among the three is the Storage & Processing layer. This layer can be represented as a multi sublayer stack with a scalable file system such as GFS at the bottom, a framework for distributed sorting and hashing (Map-Reduce) over the filesystem layer, a dataflow programming framework over the map-reduce layer and a workflow manager at the top. This talk mostly focuses on the top two layers.
Pig is a high level dataflow language on top of map-reduce, used at Yahoo and many other companies. Pig has a high level imperative syntax and is designed to make it easy for programmers to build complex pipelines. Programmers prefer breaking a bigger data processing task into smaller pipes and gluing them. Pig is designed for that. Experimental data shows that on an average a 3300 lines of map-reduce code can be written in a 200 lines of Pig code. Furthermore, what takes 4000 minutes to code in map-reduce, on an average takes only 160 minutes in Pig. There are debugging tools like Illustrate which generates examples and allows users to interactively debug the output at different points in the pipeline. It is used before running a Pig job to get a feel for what it will do, enabling certain kinds of mistakes to be caught early and cheaply. The algorithm behind the example generator performs a combination of sampling and synthesis to balance several key factors -- realism, conciseness and completeness, of the example data it produces.
Nova is a workflow manager on top of Pig. A workflow manager is important in the sense that large scale data processing is continuous in nature and thus there is a requirement to incorporate that in scheduling of workflow modules. Nova abstracts the new continuous data (logs, number of user clicks) in a way that the static processing layer (Pig/Map-Reduce) can refer to the dynamic data as they come in. It provides a high level abstraction to schedule different workflow-modules (at different rates) and define dependencies between the modules.
One of the challenging problems is debugging because data passes through many subsystems, each having different query interface, different metadata representation, different underlying models (some have files, some have records, some have workflow), etc.. It is extremely difficult to maintain consistency and thus it is essential to factor out the debugging from the subsystems. “Ibis” is an independent system that takes care of all the metadata management. All data processing subsystems ship their metadata to Ibis which ingests all the metadata, integrates them and exposes a query interface for all metadata queries. It provides a uniform view to users, factors out the metadata management code and decouples metadata lifetime from data/subsystem lifetime.
The one other challenging problem is to deal with different data and process granularity. Data granularity can vary from a web page, to a table, to a row, to a cell. Process granularity can vary from a workflow, to a pig script, to a pig job, to a map-reduce program, to a map-reduce task. It’s very hard to make an inference when the given relationship is in one granularity and the query is in other granularity and thus it’s important to capture provenance data across the workflow. While there is no one-size fit all solution, a good approach could be to use finest granularity at all levels but it has a lot of overhead and thus some smart domain specific techniques need to be employed.
Inspector Gadget and RubySky are tools designed to help programmers debug the workflow during the run time. The Inspector Gadget rewrites a pig script by inserting special UDFs that looks like no-ops to Pig. These special functions look at the data as they flow and coordinate to analyze different behaviors such as crash culprit, row level integrity etc.. Since not all of the the behaviors can be analyzed with Inspector Gadget, RubySky is used which is a scripting framework with built-in debugging support.
To conclude, this talk gives an overview of Olston’s work on large-scale data processing at Yahoo! Research. It introduces two data processing systems: Pig and Nova, elaborates on managing workflow, debugging, and looks at what can be done before, during and after execution of a data processing operation.