“Welcome to the Age of Big Data!”  The fact that “Big Data, Big Impact” is similar in fields as varied as science and sports, advertising and public health — a drift toward data-driven discovery and decision-making.

What is Big Data? “Most of the Big Data surge is data in the wild — unruly stuff like words, images and video on the Web and those streams of sensor data. It is called unstructured data and is not typically grist for traditional databases.”

What’s the potential impact of Big Data? The microscope, invented four centuries ago, allowed people to see and measure things as never before — at the cellular level. It was a revolution in measurement. Data measurement, Professor Brynjolfsson explains, is the modern equivalent of the microscope. Google searches, Facebook posts and Twitter messages, for example, make it possible to measure behavior and sentiment in fine detail and as it happens.

Big Data has its perils, to be sure. With huge data sets and fine-grained measurement, statisticians and computer scientists note, there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.”

NSF has a new solicitation out on Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) (from the post).

As a statistician, I think we should read the following column on big data: Big Data and Better Data by ASA president Robert Rodriguez. There are several points we have to pay attention to:

  • For years, statisticians have been working with large volumes of data in fields as diverse as astronomy, bioinformatics, and data mining. Big Data is different because it is generated on a massive scale by countless online interactions among people, transactions between people and systems, and sensor-enabled machinery. (This point is related with the above what is big data.)
  • Big Data is newsworthy because it promises to answer big questions. The potential of Big Data lies in innovative ways it can be linked, related, and integrated to provide more detailed and personalized information than is possible with data from a single source. (This point is related with the above measurement revolution.)
  • We, statisticians, share many skills with data scientists, but what sets us apart and why statistical thinking is critical to the process? Like other analysts, statisticians look for features in large data—and we also guard against false discovery, bias, and confounding. We build statistical models that explain, predict, and forecast—and we question the assumptions behind our models and qualify the use of our models with measures of uncertainty. We work within the limitations of available data—and we design studies and experiments to produce data with the right information content. In summary, we extract value from data not only by learning from it, but also by understanding its limitations and improving its quality. Better data matters because simply having Big Data does not guarantee reliable answers for Big Questions.

For criticizing the above paper, Roger Peng wrote a nice post Statistics and the Science Club. He explained what the statisticians should do in the Big Data Age. “Statistics should take a  leadership role in this area.

  • There’s a strong tradition in statistics of being the “outsiders” to whatever field we’re applying our methods to. In many ways, the outsider status is important because it gives us the freedom to be “arbiters” and to ensure that scientists are doing the “right” things.  However, being an arbiter by definition means that you are merely observing what is going on. You cannot be leading what is going on without losing your ability to arbitrate in an unbiased manner.
  • Because now there are data in more places than ever before, the demand for statistics is in more places than ever before. We are discovering that we can either teach people to apply the statistical methods to their data, or we can just do it ourselves! This development presents an enormous opportunity for statisticians to play a new leadership role in scientific investigations because we have the skills to extract information from the data that no one else has (at least for the moment). But now we have to choose between being “in the club” by leading the science or remaining outside the club to be unbiased arbiters.
  • I think as an individual it’s very difficult to be both. However, I think as a field, we desperately need to promote both kinds of people, if only because we are the best people for the job. We need to expand the tent of statistics and include people who are using their statistical training to lead the new science.

By the way, here is a post 5 Hidden Skills for Big Data Scientists.

Update 6/25/2012: There is a post “A Reflection On Big Data (Guest post) “ from The Geomblog.

Is Big Data just a rebranding of Massive Data?  Which is bigger, Big Data, Massive Data, or Very Large Data ?

What came out from our discussions is that we believe that Big Data is qualitatively different from the challenges that have come before.  The major change is the increasing heterogeneity of data that is being produced.  Previously, we might have considered data that was represented in a simple form, as a very high-dimensional vector, over which we want to capture key properties.  While Big Data does include such problems, it also includes many other types of data: database tables, text data, graph data, social network data, GPS trajectories, and more.  Big Data problems can consist of a mixture of examples of such data, not all of which are individually “big”, but making sense of which represents a big problem. Addressing these challenges requires not only algorithmics and systems, but machine learning, statistics and data mining insights.

And it mentioned some interesting related conferences:

  1. MMDS 2012. Workshop on Algorithms for Modern Massive Data Sets @ Stanford University from July 10–13, 2012. The Workshops on Algorithms for Modern Massive Data Sets (MMDS 2012) addressed algorithmic and statistical challenges in modern large-scale data analysis. The goals of this series of workshops are to explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured scientific and internet data sets; and to bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to promote the cross-fertilization of ideas.
  2. Big Data at Large: Applications and Algorithms @ Duke University from June14-15, 2012.
  3. 2012-13 Program on Statistical and Computational Methodology for Massive Datasets  @ SAMSI September 9-12, 2012

Update 6/26/2012: From the post The problem with small big data from Simply Statistics, there is one key thing I want to add here:

We need to get more statisticians out into the field both helping to analyze the data; and perhaps more importantly, designing good studies so that useful data are collected in the first place (as opposed to merely “big” data).

Update 7/10/2012: RSSeNews – ‘Big Data’ event now available to view