From now on, I start a serie of Q-A Sections which carry important information in the form of questions and answers.

Q1: What is the difference between what statisticians and machine learning researchers do?

The best answer that I have found so far can be found in
Statistical Modeling: The Two Cultures” by Leo Breiman (Statistical Science, 16:199-231, 2001).
According to this, statisticians like to start by making modeling assumptions about how the data is generated (e.g., the response is a noise added to the linear combination of the predictor variables), while in machine learning people use algorithm models and treat the data mechanism as unknown. He estimates that (back in 2001) less than 2% of statisticians work in the realm when the data mechanism is considered as unknown.
It seems that there are two problem with the data model approach.
One is that the this approach does not address the ultimate question which is making good predictions: if the data does not fit the model, this approach has nothing to offer (it does not make sense to apply a statistical test if the assumptions are not valid).
The other problem is that as data become more complex, data models become more cumbersome. Then why bother? With complex models we lose the advantage of easy interpretability, not talking about the computational complexity of fitting such models.
The increased interest in Bayesian modeling with Markov Chain Monte Carlo is viewed as the response of the statistical community to this problem. True enough, this approach might be able to scale to complex data, but does this address the first issue? Are not there computationally cheaper alternatives that can achieve the same prediction power?
He characterizes the machine learning approach, as the pragmatic approach: You have to solve a prediction problem, hence take it seriously: Estimate the prediction error and choose the algorithm that gives a predictor with the better accuracy (but let’s not forget about data snooping!).
But the paper offers more. Amongst other things it identifies three important recent lessons:

1. The multiplicity of good models: If you have many variables, there can be many models of similar prediction accuracy. Use them all by combining their predictions instead of just picking one. This should increase accuracy, reduce instability (sensitivity to perturbations of the data). Boosting, bagging, aggregation using exponential weights are relevant recent popular buzzwords.
2. The Occam dilemma: Occam’s razor tells you to choose the simplest predictor. Aggregated predictors don’t look particularly simple. But aggregation seems to be the right choice otherwise. I would think that Occam’s razor tells you only that you should have a prior preference to simple functions. I think this is rather well understood by now.
3. Bellman: dimensionality — curse or blessing: Many features are not bad per se. If your algorithm is prepared to deal with the high-dimensional inputs (SVMs, regularization, random forests are mentioned) then extracting many features can boost accuracy considerably.

In summary, I like the characterization of the difference between (classical) statistical approaches and machine learning. However, I wonder if these differences are still as significant as they were (must have been) in 2001 when the article was written and if the differences will become smaller over time. Then it will be really difficult to answer the question on the difference between the statistical and the machine learning approaches.

Q2: What kind of Pitfalls of optimality in statistics?

I was reading a little bit about robust statistics, as we are in the process of putting together a paper about entropy estimation where robustness comes up as an issue. While searching on the net for the best material to understand this topic (I am thinking about posting another article about what I have found), I have bumped into a nice paper (downloadable from here) by Peter J. Huber, one of the main figures in robust statistics, where he talks about a bunch of pitfalls around pursuing optimality in statistics. Huber writes eloquently — he gives plenty of examples, motivates definitions. He is just great. I can only recommend this paper or his book. Now, what are the pitfalls he writes about? He distinguishes 4 types with the following syndromes:

• The fuzzy concepts syndrome: sloppy translation of concepts into mathematics. Think about uniform vs. non-uniform convergence (sloppy asymptotics). In statistics a concrete example is the concept of efficiency which is defined in a non-uniform manner with respect to the estimable parameters, which allows for (weird) “super-efficient” estimators that pay special attention to some distinguished element of the parameter-space.
• The straitjacket syndrome: the use of overly restrictive side conditions, such as requiring that an estimator is unbiased or equivariant (equivariant estimates in high dimensions are inadmissible in very simple situations). In Bayesian statistics another example might be the convenient but potentially inappropriate conjugate priors.
• The scapegoat syndrome: confusing the model with reality (offering the model for the gods of statistics instead of the real thing, hoping that they will accept it). The classic example is the Eddington-Fisher argument. Eddington advocated the mean-absolute-deviation (MAD) instead of the root-mean-square (RMS) deviation as a measure of scale. Fisher argued that MAD estimates are highly inefficient (converge slowly) relative to the RMS deviation estimates if the sample comes from a normal distribution. Tukey has shown that the situation gets reversed even under small deviations from a normal model. The argument that under narrow conditions one estimator is better than some other should not be even made. Another example is perhaps classical optimal design and the fundamentalist approach in Bayesian statistics. Of course, there is nothing wrong with assumptions, but the results should be robust.
• The souped-up car syndrome: by optimizing for speed we can end up with an elaborate gas-guzzler. Optimizing for one quantity (efficiency) may degrade another one (robustness). Practical solutions must find a balance between such contradicting requirements.

These syndromes are not to hard identify in machine learning research. Wear protective gear as needed!

Q3: Discriminative vs. generative learning: which one is more efficient?

I just came across a paper by Philip M. Long, Rocco Servedio and Hans Ulrich Simon. (Here is a link to the paper titled “Discriminative Learning can Succeed where Generative Learning Fails”.) The question investigated in the paper is the following:
We are in a classification setting and the learning problem is defined by a pair of jointly distributed random variables, (X,Y), where Y can take on the values +1 and -1. Question: How many iid copies of this pair does an algorithm need to (i) find a classifier that yields close to optimal performance with high probability (ii) find two score functions, one trained with the positive examples only, the other with the negative examples only such that the sign of the difference of the two score functions gives a classifier that is almost optimal with high probability?
The result in the paper is that there exists a class of distributions, parameterized by d (determining the dimension of samples) such that there is a discriminative algorithm (tuned to this class) that can learn the correct classifier with only $2log(2/\delta)$ samples, while the number of samples required for any generative classifier is at least d.
Since it is clear that the requirements of generative learning are stronger than those of discriminative learning, it follows that in the above framework discriminative learning is strictly “easier” than generative learning.
The distribution concentrates on O(d) samples and the main idea is that the joint knowledge of positive and negative samples suffices for the easy identification of the target distribution (hence, classifier), while knowing only either the positive or negative examples alone is insufficient. Two special inputs, both marked for easy of recognition, determine the full distribution jointly but one of the inputs is in the positive set, the other is in the negative set and the knowledge of only one of them is insufficient to learning the otherwise “difficult” to learn distribution.
Although the construction given in the paper is simple and works as intended, it is arguably “artificial and contrived”, as it was also noted by the authors. In particular, does a similar result hold when we consider a continuous domain of a fixed dimension, and/or we restrict the class of algorithms to consistent ones? Further, the example shows more the limitation of algorithms that learn from positive and negative samples independently of each other than the limitation of generative algorithms (generative algorithms traditionally refer to learners that estimate the joint distribution of the inputs and outputs).
The question of whether generative or discriminative algorithms are more efficient are particularly interesting in light of an earlier paper by Andrew Y. Ng and Michael Jordan (“On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes”, NIPS-14, 2002). In this paper the authors compare one particular discriminative algorithm with another particular algorithm that is “generative”. The two algorithms are logistic regression and naive Bayes and together they form what is called a “Generative-Discriminative” pair. The meaning of this is that while naive Bayes maximizes the total joint log-likelihood, $\sum_{i=1}^n \log p_\theta(x_i,y_i)$ over the samples, logistic regression maximizes the total conditional likelihood, $\sum_{i=1}^n \log p_\theta(y_i|x_i)$ over the same parametric model. In the case of these two particular algorithms the parametric model is written in terms of $p(x|y)$ and asserts independence of the individual components of the feature-vector $x$. (For continuous spaces the individual feature distributions, $p(x_k|y)$, are modeled as Gaussians with unknown mean and variance.) The agnostic setting is considered (the input distribution is not restricted). Since both learners pick a classifier from the very same class of classifiers and the discriminative learner in the limit of an infinite number of samples converges to the best classifier from this class, it follows that the ultimate loss suffered by the discriminative learner is never higher than that suffered by the generative learner. Hence, it seems that if the naive Bayes assumption made by the generative method is not met, the discriminative method can have an edge — at least ultimately (open issue: give an example that shows positive separation!). However, this is just half of the story: the generative model may converge faster! In particular, the authors state an upper bound on the convergence of loss for the generative model that scales with $\sqrt{1/n \log(d)}$ ($d$ is the number of components of $x$), while as follows from standard uniform convergence results, the same convergence rate for the discriminative method is $\sqrt{d/n \log(d/n)}$. They argue that this result follows since the hypothesis class has a VC-dimension of $d$. Note the difference in the way the two bounds scale with $d$, the dimension of the input space: In the case of the discriminative algorithm the scaling is (log-)linear with $d$, while in the case of the generative algorithm it is logarithmic in $d$. (Strictly speaking, the upper bound would call for a proof since logistic regression is not a risk-minimization algorithm for the 0/1 loss and the cited theory has been worked out for risk-minimization algorithms.) Since comparing upper bounds is not really appropriate, they point to existing lower bounds that show that the worst-case sample complexity is lower bounded by $O(d)$. (My favorite paper on this is by Antos and Lugosi: “Strong minimax lower bounds for learning”: link.) Hence, the conclusion of Ng and Jordan is that, contrary to the widely held belief, generative learning can sometimes be more efficient than discriminative learning; at least when the number of features is large compared to the number of samples and the ultimate loss of the generative learner is not much higher than that of the discriminative learner. They also performed experiments on UCI datasets to validate this claim. The experiments show that the performance of naive Bayes is indeed often better when the number of examples is smaller. Unfortunately, the figures show the loss as a function of the number of samples only and hence validate the theory only half-way since for a full validation we should know how the dimension of the dataset influences the performance.
A similar conclusion to that of Ng and Jordan was shown to hold in an earlier KDD-97 paper titled “Discriminative vs. Informative Learning” by Y. Dan Rubinstein and Trevor Hastie (link) who did a case study with simulated and real examples.
The convergence rate comparison looks quite intriguing at the first sight. However, some more thinking reveals that the story is far from being finished. To see this consider the “rate” function $r(n,d) = L_{gen}$ if $n\le D-1$, $r(n,d) =\min(L_{gen},\sqrt{d/n})$ if $n\ge D$. Imagine that $r(n,d)$ is an upper bound on the loss of some learner, where $L_{gen}$ is the loss of the generative learner. The rate of convergence then is $\sqrt{d/n}$, not contradicting with the lower bounds, but clearly, this discriminative learner will be competitive with the generative learner.
So is discriminative learning more (sample) efficient than generative learning? It seems that sometimes having both the positive and negative samples at hand can be useful. However, generative learning might be advantageous when the model considered fits well the data. Hence, the case is not yet settled. Many interesting open questions!