The Big Data Problem
We define big data as data that is too big, too fast, or too hard for existing tools to process. Here, “too big” means that organizations increasingly have to deal with petabyte-scale collections of data, which come from click streams, transaction records, sensors, and many other places. “Too fast” means that not only is data big, but it needs to be processed quickly – for example, to perform fraud detection at a point of sale, or determine what ad to show to a user on a web page. “Too hard” is a catchall for data that doesn’t fit neatly into an existing processing tool, i.e., data that needs more complex analysis than existing tools can readily provide. Examples of the big data problem abound.
On the Internet, many websites now register millions of unique visitors per day. Each of these visitors may access and create a range of content. This can easily amount to tens to hundreds of gigabytes per day (tens of terabytes per year) of accumulating user and log data, even for medium sized websites. Increasingly, companies want to be able to mine this data to understand limitations of their site, improve response time, offer more targeted ads, and so on. Doing this requires tools that can perform complicated analytics on data that far exceeds the memory of a single machine or even a cluster of machines.
As another example, consider the big data problem as it applies to banks and other financial organizations. These organizations have vast quantities of data about consumer spending habits, credit card transactions, financial markets, and so on. This data is massive: for example, Visa processes more than 35B transactions per year; if they record 1 KB of data per transaction, this represents 3.5 petabytes of data per year. Visa, and large banks that issue Visa cards would like to use this data in a number of ways: to predict customers at risk of default, to detect fraud, to offer promotions, and so on. This requires complex analytics. Additionally, this processing needs to be done quickly and efficiently, and needs to be easy to tune as new models are developed and refined.
Consider the impact of new sensors on our ability to continuously monitor a patient's health. Recent advances in wireless networking, miniaturization of sensors via MEMS processes, and incredible advances in digital imaging technology have made it possible to cheaply deploy wearable sensors that monitor a number of biological signals on patients, even outside of the doctors office. These signals measure functioning of the heart, brain, circulatory system, etc. Additionally, accelerometers and touch screens can be used to assess mobility and cognitive function. This creates an unprecedented opportunity for doctors to provide outpatient care, by understanding how patients are progressing outside of the doctor’s office, and when they need to be seen urgently. Additionally, by correlating signals from thousands of different patients, it become possible to develop a new understand of what is normal or abnormal, or what kinds of signal features are indicative of potential serious problems.
Similar challenges arise across most industry sectors today including healthcare, finance, government, transportation, biotech and drug discovery, insurance, retail, telecommunications and energy.