You are currently browsing the monthly archive for June 2012.

“Welcome to the Age of Big Data!”  The fact that “Big Data, Big Impact” is similar in fields as varied as science and sports, advertising and public health — a drift toward data-driven discovery and decision-making.

What is Big Data? “Most of the Big Data surge is data in the wild — unruly stuff like words, images and video on the Web and those streams of sensor data. It is called unstructured data and is not typically grist for traditional databases.”

What’s the potential impact of Big Data? The microscope, invented four centuries ago, allowed people to see and measure things as never before — at the cellular level. It was a revolution in measurement. Data measurement, Professor Brynjolfsson explains, is the modern equivalent of the microscope. Google searches, Facebook posts and Twitter messages, for example, make it possible to measure behavior and sentiment in fine detail and as it happens.

Big Data has its perils, to be sure. With huge data sets and fine-grained measurement, statisticians and computer scientists note, there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.”

NSF has a new solicitation out on Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) (from the post).

As a statistician, I think we should read the following column on big data: Big Data and Better Data by ASA president Robert Rodriguez. There are several points we have to pay attention to:

  • For years, statisticians have been working with large volumes of data in fields as diverse as astronomy, bioinformatics, and data mining. Big Data is different because it is generated on a massive scale by countless online interactions among people, transactions between people and systems, and sensor-enabled machinery. (This point is related with the above what is big data.)
  • Big Data is newsworthy because it promises to answer big questions. The potential of Big Data lies in innovative ways it can be linked, related, and integrated to provide more detailed and personalized information than is possible with data from a single source. (This point is related with the above measurement revolution.)
  • We, statisticians, share many skills with data scientists, but what sets us apart and why statistical thinking is critical to the process? Like other analysts, statisticians look for features in large data—and we also guard against false discovery, bias, and confounding. We build statistical models that explain, predict, and forecast—and we question the assumptions behind our models and qualify the use of our models with measures of uncertainty. We work within the limitations of available data—and we design studies and experiments to produce data with the right information content. In summary, we extract value from data not only by learning from it, but also by understanding its limitations and improving its quality. Better data matters because simply having Big Data does not guarantee reliable answers for Big Questions.

For criticizing the above paper, Roger Peng wrote a nice post Statistics and the Science Club. He explained what the statisticians should do in the Big Data Age. “Statistics should take a  leadership role in this area.

  • There’s a strong tradition in statistics of being the “outsiders” to whatever field we’re applying our methods to. In many ways, the outsider status is important because it gives us the freedom to be “arbiters” and to ensure that scientists are doing the “right” things.  However, being an arbiter by definition means that you are merely observing what is going on. You cannot be leading what is going on without losing your ability to arbitrate in an unbiased manner.
  • Because now there are data in more places than ever before, the demand for statistics is in more places than ever before. We are discovering that we can either teach people to apply the statistical methods to their data, or we can just do it ourselves! This development presents an enormous opportunity for statisticians to play a new leadership role in scientific investigations because we have the skills to extract information from the data that no one else has (at least for the moment). But now we have to choose between being “in the club” by leading the science or remaining outside the club to be unbiased arbiters.
  • I think as an individual it’s very difficult to be both. However, I think as a field, we desperately need to promote both kinds of people, if only because we are the best people for the job. We need to expand the tent of statistics and include people who are using their statistical training to lead the new science.

By the way, here is a post 5 Hidden Skills for Big Data Scientists.

Update 6/25/2012: There is a post “A Reflection On Big Data (Guest post) “ from The Geomblog.

Is Big Data just a rebranding of Massive Data?  Which is bigger, Big Data, Massive Data, or Very Large Data ?

What came out from our discussions is that we believe that Big Data is qualitatively different from the challenges that have come before.  The major change is the increasing heterogeneity of data that is being produced.  Previously, we might have considered data that was represented in a simple form, as a very high-dimensional vector, over which we want to capture key properties.  While Big Data does include such problems, it also includes many other types of data: database tables, text data, graph data, social network data, GPS trajectories, and more.  Big Data problems can consist of a mixture of examples of such data, not all of which are individually “big”, but making sense of which represents a big problem. Addressing these challenges requires not only algorithmics and systems, but machine learning, statistics and data mining insights.

And it mentioned some interesting related conferences:

  1. MMDS 2012. Workshop on Algorithms for Modern Massive Data Sets @ Stanford University from July 10–13, 2012. The Workshops on Algorithms for Modern Massive Data Sets (MMDS 2012) addressed algorithmic and statistical challenges in modern large-scale data analysis. The goals of this series of workshops are to explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured scientific and internet data sets; and to bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to promote the cross-fertilization of ideas.
  2. Big Data at Large: Applications and Algorithms @ Duke University from June14-15, 2012.
  3. 2012-13 Program on Statistical and Computational Methodology for Massive Datasets  @ SAMSI September 9-12, 2012

Update 6/26/2012: From the post The problem with small big data from Simply Statistics, there is one key thing I want to add here:

We need to get more statisticians out into the field both helping to analyze the data; and perhaps more importantly, designing good studies so that useful data are collected in the first place (as opposed to merely “big” data).

Update 7/10/2012: RSSeNews – ‘Big Data’ event now available to view


What’s the difference between statistics and informatics?

Q: We always say that statistics is just dealing with data. But we also know that informatics is also getting knowledge from data analysis. For example, bioinformatics people can totally go without biostatistics. I want to know what is the essential difference between statistics and informatics.

A: But now, to answer your question, I agree that overall, statistics can’t do without computers those days. Yet, one of the major aspects of statistics is inference, which has nothing to do with computers. Satistical inference is actually what makes statistics a science, because it tells you whether or not your conclusions hold up in other contexts.—From gui11aume.

Statistics inferes from data; Informatics operates on data.—From stackovergio

What’s the difference between statistical model and probability model?

Q: Applied probability is an important branch in probability, including computational probability. Since statistics is using probability theory to construct models to deal with data, as my understanding, I am wondering what’s the essential difference between statistical model and probability model? Probability model does not need real data? Thanks.

A: A Probability Model consists of the triplet (Ω,F,P), where Ω is the sample space, F is a σ−algebra (events) and P is a probability measure on F.

Intuitive explanation. A probability model can be interpreted as a known random variable X. For example, let X be a Normally distributed random variable with mean 0 and variance 1. In this case the probability measure P is associated with the Cumulative Distribution Function (CDF).

Generalisations. The definition of Probability Model depends on the mathematical definition of probability, see for example Free probability and Quantum probability.

A Statistical Model is a set S of probability models, this is, a set of probability measures/distributions on the sample space Ω.

This set of probability distributions is usually selected for modelling a certain phenomenon from which we have data.

Intuitive explanation. In a Statistical Model, the parameters and the distribution that describe a certain phenomenon are both unknown. An example of this is the family of Normal distributions with mean μ∈R and variance σ2∈R+, this is, both parameters are unknown and you typically want to use the data set for estimating the parameters (i.e. selecting an element of S). This set of distributions can be chosen on any Ω and F, but, if I am not mistaken, in a real example only those defined on the same pair (Ω,F) are reasonable to consider.

Generalisations. This paper provides a very formal definition of Statistical Model, but the author mentions that “Bayesian model requires an additional component in the form of a prior distribution … Although Bayesian formulations are not the primary focus of this paper”. Therefore the definition of Statistical Model depend on the kind of model we use: parametric or nonparametric. Also in the parametric setting, the definition depends on how parameters are treated (e.g. Classical vs. Bayesian).

The difference is: in a probability model you know exactly the probability measure, for example a Normal (μ0,σ20), where μ,σ2 are known parameters., while in a statistical model you consider sets of distributions, for example Normal (μ,σ2), where μ,σ2 are unknown parameters.

None of them require a data set, but I would say that a Statistical model is usually selected for modelling one.—From Procrastinator.

Update 7/10/2012: If You’re Not A Programmer … You’re Not A Bioinformatician !

  1. An easy way to think about priors on linear regression
  2. Combining priors and downweighting in linear regression
  3. Metropolis Hastings MCMC when the proposal and target have differing support
  4. Slidify: Things are coming together fast
  5. How to Convert Sweave LaTeX to knitr R Markdown: Winter Olympic Medals Example
  6. Testing R Markdown with R Studio and posting it on
  7. Announcing The R markdown Package
  8. Announcing RPubs: A New Web Publishing Service for R
  9. Approximate Bayesian computation
  10. Load Packages Automatically in RStudio
  11. Practical advice for machine learning: bias, variance and what to do next
  12. The overview article on “Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis” associated with the invited talk at the upcoming PODS 2012 meeting is on the arXiv here.
  13. The monograph on “Randomized Algorithms for Matrices and Data” is available in NOW’s “Foundations and Trends in Machine Learning” series here, and it is also available on the arXiv here.
  14. Click here for information (including the slides and video!) on the Tutorial on “Geometric Tools for Identifying Structure in Large Social and Information Networks,” given originally at ICML10 and KDD10 and subsequently at many other places. (The slides are also linked to below.)
  15. The overview chapter on “Algorithmic and Statistical Perspectives on Large-Scale Data Analysis” is finally on the arXiv here; the book in which it will appear is in press; and a video of the associated talk is here.
  16. Recent teaching: Fall 2009: CS369M: Algorithms for Massive Data Set Analysis
  17. Confidence distributions
  18. Making a singular matrix non-singular
  19. Statistics Versus Machine Learning
  20. How to post R code on WordPress blogs
  21. Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)
  22. Pro Tips for Grad Students in Statistics/Biostatistics (Part 2)
  23. Why You Shouldn’t Conclude “No Effect” from Statistically Insignificant Slopes
  24. For those interested in knitr with Rmarkdown to beamer slides
  25. Notes from A Recent Spatial R Class I Gave
  26. Sparse Bayesian Methods for Low-Rank Matrix Estimation and Bayesian Group-Sparse Modeling and Variational Inference – implementation
  27. The Battle of the Bayes
  28. Ockham Workshop, Day 1
  29. Ockham Workshop, Day 2
  30. Ockham Workshop, Day 3
  31. Ockham’s Razor
  32. Occam

I have a reading plan: every night before going to bed, I will read one paper for the general idea of the paper. The paper could be from the statistics community, biostatistics community, bioinformatics community or machine learning community. And every week, I will do a summary here as a post. I hope you guys could recommend some   papers to me here. Thanks.

This summer, I am teaching an undergraduate stats class, which is a first class in stats to cover three units, descriptive statistics, probability and statistical inference. The course webpage is here.

The following paragraph is from the thesis of Michael Phillip Lesnick. It explains the relationship among the three units:

Recall first that in statistics, we distinguish between descriptive statistics and statistical inference. Descriptive statistics, as the name suggests, is that part of statistics concerned with defining and studying descriptors of data. It involves no probability theory and aims simply to offer tools for describing, summarizing, and visualizing data. Statistical inference, on the other hand, concerns the more sophisticated enterprise of estimating descriptors of an unknown probability distribution from random samples of the distribution. The theory and methods of statistical inference are built on the tools of descriptive statistics: The estimators considered in statistical inference are of course, when stripped of their inferential interpretation, merely descriptors of data.

For the teaching of such low level course, it is really challenging. There is one thing I want to share here is the post written by Dr Nic, which says that we’d better to use real data collected from the students as a source of data for use in class examples, exercises and testing. This is really a good idea. And today I saw a paper on Significance, discussing about why statistics lectures confuse students. It’s also very good reference for stats teachers.

Note: the following 4-7 are from Simply Statistics.
  1. A Personal Perspective on Machine Learning
  2. The differing perspectives of statistics and machine learning
  3. Kernel Methods and Support Vector Machines de-Mystified
  4. I love this article in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline. At least if I miss something in the raw data I don’t lose a couple billion. Spotted by Leonid K.
  5. On the other hand, this article in the Times drives me a little bonkers. It makes it sound like there is one mathematical model that will solve the obesity epidemic. Lines like this are ridiculous: “Because to do this experimentally would take years. You could find out much more quickly if you did the math.” The obesity epidemic is due to a complex interplay of cultural, sociological, economic, and policy factors. The idea you could “figure it out” with a set of simple equations is laughable. If you check out their model this is clearly not the answer to the obesity epidemic. Just another example of why statistics is not math. If you don’t want to hopelessly oversimplify the problem, you need careful data collection, analysis, and interpretation. For a broader look at this problem, check out this article on Science vs. PR. Via Andrew J.
  6. Some cool applications of the raster package in R. This kind of thing is fun for student projects because analyzing images leads to results that are easy to interpret/visualize.
  7. Check out John C.’s really fascinating post on determining when a white-collar worker is great. Inspired by Roger’s post on knowing when someone is good at data analysis.
  8. knitR Performance Report 3 (really with knitr) and dprint
  9. Unix doesn’t follow the Unix philosophy
  10. Advice on writing research articles
  11. knitr Performance Report–Attempt 3
  12. Permutation tests in R
  13. Understanding Bayesian Statistics – By Michael-Paul Agapow
  14. knitr, Slideshows, and Dropbox
  15. Generate LaTeX tables from CSV files (Excel)
  16. The Tomato Genome
  17. Optimization
  18. Sichuan Agricultural University and LC Sciences Uncover the Epigenetics of Obesity
  19. How to Stay Current in Bioinformatics/Genomics
  20. Interactive HTML presentation with R, googleVis, knitr, pandoc and slidy
  21. The R-Podcast Episode 7: Best Practices for Workflow Management
  22. What is the point of statistics and operations research?
  23. Question: C/C++ libraries for bioinformatics?
  24. 5 Hidden Skills for Big Data Scientists
  25. Protocol – Computational Analysis of RNA-Seq

Blog Stats

  • 102,424 hits

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 523 other followers

Twitter Updates