You are currently browsing the category archive for the ‘Q-A Section’ category.

The following is from Quora-Kaggle. And I think it’s very useful for data scientist to do a project.

Question:

To do well in a competition, clearly many aspects are important.
But, what do you think helped you (or a top competitor you know) do better than others?

At a high level, I’m curious if you approach a competition with one-size-fits-all approach or try to develop a particular insight.

More specifically, I’m interested in your thoughts on utility of the following:
– bringing more data to the party
– feature engineering
– data prep/normalization
– superior knowledge of subtleties of ML
– know-how of tackling predictive modeling problems
– proprietary tools
– picking a problem you had particular insight about
– picking less competitive competition(s)
– forming a good team
– seek specialist’s advice
– persistence
– luck

Thanks for asking me to answer this question (I guess at least one person thinks I am a top kaggle competitor!).  Anyone please feel free to correct anything inaccurate or off base here.

This is a tough question to answer, because much like any competitive endeavor, any given Kaggle competition requires a unique blend of skills and several different factors.  In some competitions, luck plays a large part.  In others, an element that you had not considered at all will play a large part.

For example, I was first and/or second for most of the time that the Personality Prediction Competition [1] ran, but I ended up 18th, due to overfitting in the feature selection stage, something that I has never encountered before with the method I used.  A good post on some of the seemingly semi-random shifts that happen at the end of a competition can be found on the Kaggle blog [2].

Persistence, Persistence, and more Persistence

You have outlined some key factors to success.  Not all of them are applicable to all competitions, but finding the one that does apply is key.  In this, persistence is very important.  It is easy to become discouraged when you don’t get into the top 5 right away, but it is definitely worth it to keep trying.  In one competition, I think that I literally tried every single published method on a topic.

In my first ever Kaggle competition, the Photo Quality Prediction [3] competition, I ended up in 50th place, and had no idea what the top competitors had done differently from me.

I managed to learn from this experience, however, and did much better in the my second competition, the Algorithmic Trading Challenge [4].

What changed the result from the Photo Quality competition to the Algorithmic Trading competition was learning and persistence.  I did not really spend much time on the former competition, and it showed in the results.

Expect to make many bad submissions that do not score well.  You should absolutely be reading as much relevant literature (and blog posts, etc), as you can while the competition is running.  As long as you learn something new that you can apply to the competition later, or you learn something from your failed submission (maybe that a particular algorithm or approach is ill-suited to the data), you are on the right track.

This persistence needs to come from within, though.  In order to make yourself willing to do this, you have to ask yourself why you are engaging in a particular competition.  Do you want to learn?  Do you want to gain opportunities by placing highly?  Do you just want to prove yourself?  The monetary reward in most Kaggle competitions is not enough to motivate a significant time investment, so unless you clearly know what you want and how to motivate yourself, it can be tough to keep trying.  Does rank matter to you?  If not, you have the luxury of learning about interesting things that may or may not impact score, but you don’t if you are trying for first place.

The Rest of the Factors

Now that I have addressed what I think is in the single most important factor (persistence), I will address the rest of your question:

1.  The most important data-related factor (to me) is how you prepare the data, and what features you extract.  Algorithm selection is important, but much less so.  I haven’t really seen the use of any proprietary tools among top competitors, although a couple of first place finishers have used open-source tools that they coded/maintain.

2.  I have had poor results with external data, typically.  Unless you notice someone on the leaderboard who has a huge amount of separation from the rest of the pack (or a group that has separation), it is unlikely that anyone has found “killer” external data.  That said, you should try to use all the data you are given, and there are often innovative ways to utilize what you are given to generate larger training sets.  An example is the Benchmark Bond Competition [5], where the competition hosts released two datasets because the first one could be reverse-engineered easily.  Using both more than doubled the available train data (this did not help score, and I did not use it in the final model, but it it an illustration of the point).

3.  Initial domain-specific knowledge can be helpful (some bond pricing formulas, etc, helped me in the Benchmark Bond competition), but it is not critical, and what you need can generally be picked up by learning while you are competing.  For example, I learned NLP methods while I competed in the Hewlett Foundation ASAP Competition.  That said, you definitely need to quickly learn the relevant domain-specific elements that you don’t know, or you will not really be able to compete in most competitions.

4.  Picking a less competitive competition can definitely be useful at first.  The research competitions tend to have less competitors than the ones with large prizes.  Later on, I find it useful to compete in more competitive competitions because it forces you to learn more and step outside your comfort zone.

5.  Forming a good team is critical.  I have been lucky enough to work with great people on two different competitions (ASAP and Bond), and I learned a lot from them.  People tend to be split into those that almost always work alone and those that almost always team up, but it is useful to try to do both.  You can learn a lot from working in a team, but working on your own can make you learn things that you might otherwise rely on a teammate for.

6.  Luck plays a part as well.  In some competitions, .001% separates 3rd and 4th place, for example.  At that point, its hard to say whose approach is “better”, but only one is generally recognized as a winner.  A fact of Kaggle, I suppose.

7.  The great thing about machine learning is that you can apply similar techniques to almost any problem.  I don’t think that you need to pick problems that you have a particular insight about or particular knowledge about, because frankly, it’s more interesting to do something new and learn about it as you go along.  Even if you have a great insight on day one, others will likely think of it, but they may do so on day 20 or day 60.

8.  Don’t be afraid to get a low rank.  Sometimes you see an interesting competition, but think that you won’t be able to spend much time on it, and may not get a decent rank.  Don’t worry about this.  Nobody is going to judge you!

9.  Every winning Kaggle entry is the combination of dozens of small insights.  There is rarely one large aha moment that wins you everything.  If you do all of the above, make sure you keep learning, and keep working to iterate your solution, you will do well.

Learning is Fun?

I think that the two main elements that I stressed here are persistence and learning.  I think that these two concepts encapsulate my Kaggle experience nicely, and even if you don’t win a competition, as long as you learned something, you spent your time wisely.

References

eQTL tries to regress each gene expression against each SNP, in order to find those regulatory elements. And eQTL uses “normal” samples, right? (by normal I mean “no disease” like those in 1000genome project)

GWAS compares SNPs between normal(control) and disease(test) samples, trying to find out those higher-frequency variants enriched for diseases.

linkage mapping/recombination mapping/positional cloning – rely on known markers (typically SNPs) that are close to the gene responsible for a disease or trait to segregate with that marker within a family. Works great for high-penetrance, single gene traits and diseases.

QTL mapping/interval mapping – for quantitative traits like height that are polygenic. Same as linkage mapping except the phenotype is continuous and the markers are put into a scoring scheme to measure their contribution – i.e. “marker effects” or “allelic contribution”. Big in agriculture.

GWAS/linkage disequilibrium mapping – score thousands of SNPs at once from a population of unrelated individuals. Measure association with a disease or trait with the presumption that some markers are in LD with, or actually are, causative SNPs.

So linkage mapping and QTL mapping are similar in that they rely on Mendelian inheritance to isolate loci. QTL mapping and GWAS are similar in that they typically measure association in terms of log-odds along a genetic or physical map and do not assume one gene or locus is responsible. And finally, linkage mapping and GWAS are both concerned with categorical traits and diseases.

Linkage studies are performed when you have pedigrees of related individals and a phenotype (such as breast cancer) that is present in some but not all of the family members. These individuals could be humans or animals; linkage in humans is studied using existing families, so no breeding is involved. For each locus, you tabulate cases where parents and children who do or don’t show the phenotype also have the same allele. Linkage studies are the most powerful approach when studying highly penetrant phenotypes, which means that if you have the allele you have a strong probability of exhibiting the phenotype. They can identiy rare alleles that are present in small numbers of families, usualy due to a founder mutation. Linkage is how you find an allele such as the mutations in BRCA1 associated with breast cancer.

Association studies are used when you don’t have pedigrees; here the statistical test is a logistic regression or a related test for trends. They work when the phenotype has much lower penetrance; they are in fact more powerful than linkage analysis in those cases, provided you have enough informative cases and matched controls. Association studies are how you find common, low penetrance alleles such as the variations in FGFR2 that confer small increases in breast cancer susceptibility.

In The Old Days, neither association tests nor linkage tests were “genome-wide”; there wasn’t a technically feasable or affordable way to test the whole genome at once. Studies were often performed at various levels of resolution as the locus associated with the phenotype was refined. Studies were often performed with a small number of loci chosen because of prior knowledge or hunches. Now the most common way to perform these studies in humans is to use SNP chips that measure hundreds of thousands of loci spread across the whole genome, thus the name GWAS. The reason you’re testing “the whole genome” without sequencing the whole genome of each case and control is an important point that is a separate topic; if you don’t yet know how this works, start with the concept of Linkage Disequilibrium. I haven’t encountered the term GWLS myself, but I think it’s safe to say that this is just a way to indicate that the whole genome was queried for linkage to a phenotype.

Genomic Convergence of Genome-wide Investigations for Complex Traits

###############################################################################

The following comes from Khader Shameer:

The following articles were really useful for me to understand the concepts around GWAS.

I would recommend the following reviews to understand the concept and methods. Most of these reviews refers the major studies and specific details can be obtained from individual papers. But you can get an overall idea about the concept, statistical methods and expected results from a GWAS studies from these review articles.

How to Interpret a Genome-wide Association Study

An easy to ready review article that start with basic concepts and discuss future prospects of GWAS Genome-wide association studies and beyond.

A detailed introduction to basic concepts of GWAS from the perspective of vascular disease : Genome-wide Association Studies for Atherosclerotic Vascular Disease and Its Risk Factors

Great overview of the current state of GWAS studies: Genomewide Association Studies and Assessment of the Risk of Disease

Detailed overview of statistical methods : Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application

For a bioinformatics perspective Jason Moore et.al’s review will be a good start : Bioinformatics Challenges for Genome-Wide Association Studies

Soumya Raychaudhuri’s review provides overview of various approaches for interpretations of variants from GWAS Mapping rare and common causal alleles for complex human diseases.

A tutorial on statistical methods for population association studies

Online Resources: I would recommend to start from GWAS page at Wikipedia followed by NIH FAQ on GWAS, NHGRI Catalog of GWAS, dbGAP, GWAS integrator and related question at BioStar.

##############################################################################

For introductory material, the new blog Genomes Unzipped has a couple of great posts: (From Neilfws)

### What’s the difference between statistics and informatics?

Q: We always say that statistics is just dealing with data. But we also know that informatics is also getting knowledge from data analysis. For example, bioinformatics people can totally go without biostatistics. I want to know what is the essential difference between statistics and informatics.

A: But now, to answer your question, I agree that overall, statistics can’t do without computers those days. Yet, one of the major aspects of statistics is inference, which has nothing to do with computers. Satistical inference is actually what makes statistics a science, because it tells you whether or not your conclusions hold up in other contexts.—From gui11aume.

Statistics inferes from data; Informatics operates on data.—From stackovergio

### What’s the difference between statistical model and probability model?

Q: Applied probability is an important branch in probability, including computational probability. Since statistics is using probability theory to construct models to deal with data, as my understanding, I am wondering what’s the essential difference between statistical model and probability model? Probability model does not need real data? Thanks.

A: A Probability Model consists of the triplet (Ω,F,P), where Ω is the sample space, F is a σ−algebra (events) and P is a probability measure on F.

Intuitive explanation. A probability model can be interpreted as a known random variable X. For example, let X be a Normally distributed random variable with mean 0 and variance 1. In this case the probability measure P is associated with the Cumulative Distribution Function (CDF).

Generalisations. The definition of Probability Model depends on the mathematical definition of probability, see for example Free probability and Quantum probability.

A Statistical Model is a set S of probability models, this is, a set of probability measures/distributions on the sample space Ω.

This set of probability distributions is usually selected for modelling a certain phenomenon from which we have data.

Intuitive explanation. In a Statistical Model, the parameters and the distribution that describe a certain phenomenon are both unknown. An example of this is the family of Normal distributions with mean μ∈R and variance σ2∈R+, this is, both parameters are unknown and you typically want to use the data set for estimating the parameters (i.e. selecting an element of S). This set of distributions can be chosen on any Ω and F, but, if I am not mistaken, in a real example only those defined on the same pair (Ω,F) are reasonable to consider.

Generalisations. This paper provides a very formal definition of Statistical Model, but the author mentions that “Bayesian model requires an additional component in the form of a prior distribution … Although Bayesian formulations are not the primary focus of this paper”. Therefore the definition of Statistical Model depend on the kind of model we use: parametric or nonparametric. Also in the parametric setting, the definition depends on how parameters are treated (e.g. Classical vs. Bayesian).

The difference is: in a probability model you know exactly the probability measure, for example a Normal (μ0,σ20), where μ,σ2 are known parameters., while in a statistical model you consider sets of distributions, for example Normal (μ,σ2), where μ,σ2 are unknown parameters.

None of them require a data set, but I would say that a Statistical model is usually selected for modelling one.—From Procrastinator.

Update 7/10/2012: If You’re Not A Programmer … You’re Not A Bioinformatician !

Information Geometry is applying differential geometry to families of probability distributions, and so to statistical models. Information does however play two roles in it: Kullback-Leibler information, or relative entropy, features as a measure of divergence (not quite a metric, because it’s asymmetric), and Fisher information takes the role of curvature.

One very nice thing about information geometry is that it gives us very strong tools for proving results about statistical models, simply by considering them as well-behaved geometrical objects. Thus, for instance, it’s basically a tautology to say that a manifold is not changing much in the vicinity of points of low curvature, and changing greatly near points of high curvature. Stated more precisely, and then translated back into probabilistic language, this becomes the Cramer-Rao inequality, that the variance of a parameter estimator is at least the reciprocal of the Fisher information. As someone who likes differential geometry, and now is interested in statistics, I find this very pleasing.

As a physicist, I have always been somewhat bothered by the way statisticians seem to accept particular parametrizations of their models as obvious and natural, and build those parameterization into their procedures. In linear regression, for instance, it’s reasonably common for them to want to find models with only a few non-zero coefficients. This makes my thumbs prick, because it seems to me obvious that if I regressed on arbitrary linear combinations of my covariates, I have exactly the same information (provided the transformation is invertible), and so I’m really looking at exactly the same model — but in general I’m not going to have a small number of non-zero coefficients any more. In other words, I want to be able to do coordinate-free statistics. Since differential geometry lets me do coordinate-free physics, information geometry seems like an appealing way to do this. There are various information-geometric model selection criteria, which I want to know more about; I suspect, based purely on this disciplinary prejudice, that they will out-perform coordinate-dependent criteria.

[From Information Geometry]

The following from the abstract of the tutorial given by Shun-ichi Amari, RIKEN Brain Science Institute in Algebraic Statistics 2012.

We give fundamentals of information geometry and its applications. We often treat a family of probability distributions for understanding stochastic phenomena in the world. When such a family includes n free parameters, it is parameterized by a real vector of n dimensions. This is regarded as a manifold, where the parameters play a role of a coordinate system. A natural question arises: What is the geometrical structure to be introduced in such a manifold. Geometrical structure gives, for example, a distance measure between two distributions and a geodesic line connecting two distributions. The second question is to know how useful the geometry is for understanding properties of statistical inference and designing new algorithms for inference.

The first question is answered by the invariance principle such that the geometry should be invariant under coordinate transformations of random variables. More precisely, it should be invariant by using sufficient statistics as random variables. It is surprising that this simple criterion gives a unique geometrical structure, which consists of a Riemannian metric and a family of affine connections which define geodesics.

The unique Riemannian metric is proved to be the Fisher information matrix. The invariant affine connections are not limited to the Riemannian (Levi-Civita) connection but include the exponential and mixture connections, which are dually coupled with respect to the metric. The connections are dually flat in typical cases such as exponential and mixture families.

A dually flat manifold has a canonical divergence function, which in our case is the Kullback-Leibler divergence. This implies that the KL-divergence is induced from the geometrical flatness. Moreover, there exist two affine coordinate systems, one is the natural or canonical parameters and the other is the expectation parameters in the case of an exponential family. They are connected by the Legendre transformation. A generalized Pythagorean theorem holds with respect to the canonical divergence and the pair of dual geodesics. A generalized projection theorem is derived from it.

These properties are useful for elucidating and designing algorithms of statistical inference. They are used not only for evaluating the higher-order characteristics of estimation and testing, but also for elucidating machine learning, pattern recognition and computer vision. We further study the procedures of semi-parametric estimation together with estimating functions. It is also applied to the analysis of neuronal spiking data by decomposing the firing rates of neurons and their higher-order correlations orthogonally.

The dually flat structure is useful for optimization of various kinds. A manifold needs not be connected with probability distributions. The invariance criterion does not work in such cases. However, a convex function plays a fundamental role in such a case, and we obtain a Riemannian metric together with two dually coupled flat affine connections, connected by the Legendre transformation. The Pythagorean theorem and projection theorem play again a fundamental role in applications of information geometry.

Q: [From Mathangi Gopalakrishnan] Generating different sets of data, and within each data set, perform some calculations based on the fit of each data. The “for” loops does the job but it take days for simulations to end.

How to improve the efficiency of the coding? The R-help suggested vectorizing the operations, but how to do them for this problem.

The idea for this code is

for ( i in 1: 1000)
{
Generate a set of data.
For each data set, fit the dataset, perform calculations, keep them.
for(j in 1:1000)
{
Use the fitted values to generate 1000 sets of data,
Perform calculations and compare with the previous calculations.
}
}

A1[From Janice Brodsky]: Loops run slowly in R.  You want to take a look at the “apply” functions (lapply, sapply, etc.).  Here are some webpages with info about them:

A2[From Anne-Sophie Charest]: I also recommend the function “replicate”, which is very similar to lapply and such. For example:

replicate(40, rnorm(100)) – returns a matrix of 40 columns where each column is a different draw of 100 N(0,1) observations
Similarly, lapply(1:40, rnorm, 100) will return a list with 40 items, each with a vector of 100 N(0,1) draws.
I prefer the synthax of replicate when doing simulations like yours.

In your case you could do something like this:

replicate(1000, function1)
where function1 is the function which does all you want to do for one of the dataset

withing function1, at some point use replicate(1000, function2)
where function2 is the function which generates one dataset from the fitted values (to be passed as argument) and does the calculations you want in the second loop

If this is still too slow, you can also look into parallel computing. Some info at http://cran.r-project.org/web/views/HighPerformanceComputing.html.

A3[From Patrick Rhodes]: The suggestions listed thus far are excellent and should be used (really – you should be able to use some apply functions here).  Whether you use them or stay with your for loops, you almost certainly will have performance issues when you nest 1000 sets of data within another loop of 1000 sets of data – that’s 1 million sets of data that the computer has to track (internally it will have to keep memory pointers, allocate stacks and several other tasks).  Additionally, the amount of internal memory you have – no matter how great – will be insufficient, thus forcing your computer to use your hard drive to compensate, which slows things down tremendously.

There are a couple of routes you might experiment with:
1) *If* you choose to stay with nested loops, run garbage collection after each outer loop.  See here for more documentation:  http://stat.ethz.ch/R-manual/R-devel/library/base/html/gc.html

Once you are done with an outer loop run (i.e. on that set of data, it has then run the inner loop 1000 times), remove that set of data and run garbage collection.  By doing that, it *should* free up all of the stack space and pointers and whatever other resources the computer was using to track it (hopefully).

But again, nested loops are awful – the amount of time to allocate, re-allocate, etc. is tremendous.

I would not recommend this approach, however – ever.

2) Consider using the ‘by’ function.  In this event, you will run the outer loop without any inner loop, creating the 1000 outer data sets.  Store those data sets in a vector.  Then use the by function on that structure to run your inner 1000 calculations.  Again, consider using garbage collection as you do.  Without seeing the code, it’s difficult to know how this would work exactly.

Using this method might avoid some re-allocation issues, but I’m not sure without experimenting.

Again, not recommended.

3) Use apply functions.  Honestly, however, underneath the hood these are written in C (as are the for loops) and still will have looping to manage.  But, you have the advantage of these being written by professional engineers which would have optimized their performance.  And it saves you from re-inventing the wheel.

=================

There are almost certainly some other ways to optimize your code that we can’t see here.  Are you printing out a bunch of stuff on each run?  That will slow it down significantly.  Perhaps consider storing results to a log file that you append to on each run.  Your code may also have some optimization issues within the loops – perhaps you are performing your calculations in a way that take significant resources without realizing it.

To see how your changes would work, drop your outer loop to 10 data sets as well as your inner data sets (10).  Time how long each run takes as you optimize and you will eventually get a fast iteration.  Once you have it optimized, then go back to your original sizes.

But use the apply functions.

A4[From John Dawson]: The suggestions that have been made thus far are good, but two thoughts:

1) The superiority of the apply series over loops is not what it used to be. More recent versions of R may not exhibit order of magntiude differences in runtime between the two, depending on what kind of computations are being done.

2) If you really want to improve speed, consider outsourcing the computation to a ‘lower-level’ language such as C or C++. An excellent package for doing so in R is Rcpp. With Rcpp, you don’t have to worry about assigning memory or knowing too much underlying C++, since most of the command structure is built to be R-like. For a day’s worth of investment in learning how to use the package, you might see two orders of manitude improvement in runtime – I did in one of my projects.

Response[From Mathangi Gopalakrishnan]: Thank you all for the quick and clear responses.

Use of “replicate” solves some of my problems, and as suggested by Patrick I am in the process of optimizing my runs.

Hope to invest some time learning about Rcpp package.

Q: What are Bartlett corrections?

A: Strictly speaking, a Bartlett correction is a scalar transformation applied to the likelihood ratio (LR) statistic that yields a new, improved test statistic which has a chi-squared null distribution to order O(1/n). This represents a clear improvement over the original statistic in the sense that LR is distributed as chi-squared under the null hypothesis only to order O(1).

Q: Are there extensions of Bartlett corrections?

A: Yes. Some of them arose in response to Sir David Cox’s 1988 paper, “Some aspects of conditional and asymptotic inference: a review” (Sankhya A). A particularly useful one was proposed by Gauss Cordeiro and Silvia Ferrari in a 1991 Biometrika paper. They have shown how to Bartlett-correct test statistics whose null asymptotic distribution is chi-squared with special emphasis on Rao’s score statistic.

Q: Where can I find a survey paper on Bartlett corrections?

A: There are a few around. Two particularly useful ones are:

Q: What are the alternatives to Bartlett corrections?

A: There are several alternatives. A closely related one are Edgeworth expansions, named after the economist/statistician Francis Ysidro Edgeworth. There are also saddlepoint expansions. A computer-intensive alternative is known as the bootstrap and was proposed by Bradley Efron in his 1979 Annals of Statistics paper.

Please refer to Bartlett Corrections Page.

### What causes loss of power in hypothesis testing?

Several days ago, I asked the above question in StackExchange. And the following is the list of answers:

There is an enormous literature on this subject; I’ll just give you a quick thumbnail sketch. Let’s say you’re testing groups’ mean differences, as in T-tests. Power will be reduced if…

1. …variables are measured unreliably. This will in essence “fuzz over” the differences.
2. …variability within groups is high. This will render between-group differences less noteworthy.
3. …your criterion for statistical significance is strict, e.g., .001 as opposed to the more common .05.
4. …you are using a one-tailed test, hypothesizing that if there is a difference, a certain group will have the higher mean. This cuts down on opportunistically significant findings that occur with two-tailed tests.
5. …you are dealing with (or expecting) very slim mean differences in the first place. Small effect sizes decrease power.

If you google “power calculator” you’ll find some nifty sites that demonstrate the way these factors interact, depending on values you input. And if you download the free Gpower3 program you’ll have a nice tool for calculating power and considering the effects of different conditions–for a wide variety of statistical procedures.

Another one is:

Think of power as your ability to recognize the truthfulness of one of two competing data generating processes. It will obviously be easier when:

1. The data have much signal with little noise (good SNR).
2. The competing processes are very different one from the other (null and alternative are separated).
3. You have ample amounts of data.
4. You are not too concerned with mistakes (large type I error rate).

If the above do not hold, it will be harder to choose the “true” process. I.e., you will lose power.

Q4: Choosing a First Machine Learning Project: Start by Reading or by Doing?

Sarath writes about doing a project during his final year of university related to machine learning:

I am writing this email to ask for some advice. well the thing is i haven’t decided on my project yet, as i decided it will be better if i took some time to just strengthen my fundamentals and may be work on something small. well i came across this great blog called measuring measures where they had put up a reading list for machine learning and it was may i say a bit overwhelming. http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html

My present goal is doing a graduate course in some good university with some good machine learning research and one of the reason i wanted to do a great project as i have heard that would be a great way to getting into a good university.

So my question is should my first priority be getting a really good and deep understanding of the subject or should i be more concerned with doing some good project with respect to admissions?

There are others who are likely more qualified than I am to answer this one, but here are my two cents:

That post certainly has things that would be nice to learn, but you don’t need to know all of that in order to be a successful researcher. Depending on what area you go into, you might need different subsets of those references, or you might need something different all together. (For example, a reference I go back to time and time again is Schrijver’s Combinatorial Optimization, but it’s not on that list).

I think you should pick a project in an area that you find interesting, then just dive in. At first, I’d be less concerned with doing something new. First, focus on understanding a couple different existing approaches to the specific problem you’ve chosen, and pick up the necessary background as you go by trying to implement the algorithms and replicate published results, following references when you get confused, looking up terms, etc. Perhaps most importantly, work on your research skills. Important things:

• Clearly write up exactly what you are doing and why you are doing it. Keep it as short as possible while still having all the important information.
• Set up a framework so you are organized when running experiments
• Even if the results are not state of the art or terribly surprising, keep track of all the outputs of all your different executions with different data sets as inputs, different parameter settings, etc.
• Visualize everything interesting about the data you are using, the execution of your algorithms, and your results. Look for patterns, and try to understand why you are getting the results that you are.

All the while, be on the lookout for specific cases where an algorithm doesn’t work very well, assumptions that seem strange, or connections between the approach you’re working on to other algorithms or problems that you’ve run across before. Any of these can be the seed of a good research project.

In my estimation, I’d think graduate schools would be more impressed by a relevant, carefully done project, even if it’s not terribly novel, than they would be with you saying on your application that you have read a lot of books.

If you’re looking for project ideas, check out recent projects that have been done by students of Andrew Ng’s machine learning course at Stanford:
http://www.stanford.edu/class/cs229/projects2008.html
http://www.stanford.edu/class/cs229/projects2009.html

Perhaps some readers who have experience on graduate committees can correct or add to anything that I said that was wrong or incomplete.Sarath writes about doing a project during his final year of university related to machine learning:

I am writing this email to ask for some advice. well the thing is i haven’t decided on my project yet, as i decided it will be better if i took some time to just strengthen my fundamentals and may be work on something small. well i came across this great blog called measuring measures where they had put up a reading list for machine learning and it was may i say a bit overwhelming. http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html

My present goal is doing a graduate course in some good university with some good machine learning research and one of the reason i wanted to do a great project as i have heard that would be a great way to getting into a good university.

So my question is should my first priority be getting a really good and deep understanding of the subject or should i be more concerned with doing some good project with respect to admissions?

There are others who are likely more qualified than I am to answer this one, but here are my two cents:

That post certainly has things that would be nice to learn, but you don’t need to know all of that in order to be a successful researcher. Depending on what area you go into, you might need different subsets of those references, or you might need something different all together. (For example, a reference I go back to time and time again is Schrijver’s Combinatorial Optimization, but it’s not on that list).

I think you should pick a project in an area that you find interesting, then just dive in. At first, I’d be less concerned with doing something new. First, focus on understanding a couple different existing approaches to the specific problem you’ve chosen, and pick up the necessary background as you go by trying to implement the algorithms and replicate published results, following references when you get confused, looking up terms, etc. Perhaps most importantly, work on your research skills. Important things:

• Clearly write up exactly what you are doing and why you are doing it. Keep it as short as possible while still having all the important information.
• Set up a framework so you are organized when running experiments
• Even if the results are not state of the art or terribly surprising, keep track of all the outputs of all your different executions with different data sets as inputs, different parameter settings, etc.
• Visualize everything interesting about the data you are using, the execution of your algorithms, and your results. Look for patterns, and try to understand why you are getting the results that you are.

All the while, be on the lookout for specific cases where an algorithm doesn’t work very well, assumptions that seem strange, or connections between the approach you’re working on to other algorithms or problems that you’ve run across before. Any of these can be the seed of a good research project.

In my estimation, I’d think graduate schools would be more impressed by a relevant, carefully done project, even if it’s not terribly novel, than they would be with you saying on your application that you have read a lot of books.

If you’re looking for project ideas, check out recent projects that have been done by students of Andrew Ng’s machine learning course at Stanford:
http://www.stanford.edu/class/cs229/projects2008.html
http://www.stanford.edu/class/cs229/projects2009.html

Perhaps some readers who have experience on graduate committees can correct or add to anything that I said that was wrong or incomplete.

Q5: What are some good resources for learning about machine learning?

There were some good answers, even some of which I didn’t know about. Here’s a sampling of the answers:

My answer was Andrew Ng’s YouTube videos:

Some other good ones:
Jie Tang says…

Mike Jordan and his grad students teach a course at Berkeley called Practical Machine Learning which presents a broad overview of modern statistical machine learning from a practitioner’s perspective. Lecture notes and homework assignments from last year are available at
http://www.cs.berkeley.edu/~jordan/courses/294-fall09/

A Google search will also turn up material from past years

Ben Newhouse says…

The textbook “Elements of Statistical Learning” has an obscene amount of material in it and is freely available in PDF form via http://www-stat.stanford.edu/~tibs/ElemStatLearn/

While more niche than general Machine Learning, I recently ripped through “Natural Image Statistics” (also downloadable at http://www.naturalimagestatistics.net/ ). It’s a great read both for its explanations of your standard ML algo’s (PCA, ICA, mixed gaussians etc) and for its real-world applications/examples in trying to understand the models used for analysis in our neural vision system

Jeremy Leibs gives the staple of David MacKay’s book (I believe David MacKay would say that machine learning is just information theory), right?:

“Information Theory, Inference, and Learning Algorithms” by David MacKay has some decent introductory material if I remember. Available online:
http://www.inference.phy.cam.ac.uk/mackay/itila/book.html

Incidentally, I haven’t read Programming Collective Intelligence, but it seems popular amongst non researchers. Do any of you know more about it?

From now on, I start a serie of Q-A Sections which carry important information in the form of questions and answers.

Q1: What is the difference between what statisticians and machine learning researchers do?

The best answer that I have found so far can be found in
Statistical Modeling: The Two Cultures” by Leo Breiman (Statistical Science, 16:199-231, 2001).
According to this, statisticians like to start by making modeling assumptions about how the data is generated (e.g., the response is a noise added to the linear combination of the predictor variables), while in machine learning people use algorithm models and treat the data mechanism as unknown. He estimates that (back in 2001) less than 2% of statisticians work in the realm when the data mechanism is considered as unknown.
It seems that there are two problem with the data model approach.
One is that the this approach does not address the ultimate question which is making good predictions: if the data does not fit the model, this approach has nothing to offer (it does not make sense to apply a statistical test if the assumptions are not valid).
The other problem is that as data become more complex, data models become more cumbersome. Then why bother? With complex models we lose the advantage of easy interpretability, not talking about the computational complexity of fitting such models.
The increased interest in Bayesian modeling with Markov Chain Monte Carlo is viewed as the response of the statistical community to this problem. True enough, this approach might be able to scale to complex data, but does this address the first issue? Are not there computationally cheaper alternatives that can achieve the same prediction power?
He characterizes the machine learning approach, as the pragmatic approach: You have to solve a prediction problem, hence take it seriously: Estimate the prediction error and choose the algorithm that gives a predictor with the better accuracy (but let’s not forget about data snooping!).
But the paper offers more. Amongst other things it identifies three important recent lessons:

1. The multiplicity of good models: If you have many variables, there can be many models of similar prediction accuracy. Use them all by combining their predictions instead of just picking one. This should increase accuracy, reduce instability (sensitivity to perturbations of the data). Boosting, bagging, aggregation using exponential weights are relevant recent popular buzzwords.
2. The Occam dilemma: Occam’s razor tells you to choose the simplest predictor. Aggregated predictors don’t look particularly simple. But aggregation seems to be the right choice otherwise. I would think that Occam’s razor tells you only that you should have a prior preference to simple functions. I think this is rather well understood by now.
3. Bellman: dimensionality — curse or blessing: Many features are not bad per se. If your algorithm is prepared to deal with the high-dimensional inputs (SVMs, regularization, random forests are mentioned) then extracting many features can boost accuracy considerably.

In summary, I like the characterization of the difference between (classical) statistical approaches and machine learning. However, I wonder if these differences are still as significant as they were (must have been) in 2001 when the article was written and if the differences will become smaller over time. Then it will be really difficult to answer the question on the difference between the statistical and the machine learning approaches.

Q2: What kind of Pitfalls of optimality in statistics?

I was reading a little bit about robust statistics, as we are in the process of putting together a paper about entropy estimation where robustness comes up as an issue. While searching on the net for the best material to understand this topic (I am thinking about posting another article about what I have found), I have bumped into a nice paper (downloadable from here) by Peter J. Huber, one of the main figures in robust statistics, where he talks about a bunch of pitfalls around pursuing optimality in statistics. Huber writes eloquently — he gives plenty of examples, motivates definitions. He is just great. I can only recommend this paper or his book. Now, what are the pitfalls he writes about? He distinguishes 4 types with the following syndromes:

• The fuzzy concepts syndrome: sloppy translation of concepts into mathematics. Think about uniform vs. non-uniform convergence (sloppy asymptotics). In statistics a concrete example is the concept of efficiency which is defined in a non-uniform manner with respect to the estimable parameters, which allows for (weird) “super-efficient” estimators that pay special attention to some distinguished element of the parameter-space.
• The straitjacket syndrome: the use of overly restrictive side conditions, such as requiring that an estimator is unbiased or equivariant (equivariant estimates in high dimensions are inadmissible in very simple situations). In Bayesian statistics another example might be the convenient but potentially inappropriate conjugate priors.
• The scapegoat syndrome: confusing the model with reality (offering the model for the gods of statistics instead of the real thing, hoping that they will accept it). The classic example is the Eddington-Fisher argument. Eddington advocated the mean-absolute-deviation (MAD) instead of the root-mean-square (RMS) deviation as a measure of scale. Fisher argued that MAD estimates are highly inefficient (converge slowly) relative to the RMS deviation estimates if the sample comes from a normal distribution. Tukey has shown that the situation gets reversed even under small deviations from a normal model. The argument that under narrow conditions one estimator is better than some other should not be even made. Another example is perhaps classical optimal design and the fundamentalist approach in Bayesian statistics. Of course, there is nothing wrong with assumptions, but the results should be robust.
• The souped-up car syndrome: by optimizing for speed we can end up with an elaborate gas-guzzler. Optimizing for one quantity (efficiency) may degrade another one (robustness). Practical solutions must find a balance between such contradicting requirements.

These syndromes are not to hard identify in machine learning research. Wear protective gear as needed!

Q3: Discriminative vs. generative learning: which one is more efficient?

I just came across a paper by Philip M. Long, Rocco Servedio and Hans Ulrich Simon. (Here is a link to the paper titled “Discriminative Learning can Succeed where Generative Learning Fails”.) The question investigated in the paper is the following:
We are in a classification setting and the learning problem is defined by a pair of jointly distributed random variables, (X,Y), where Y can take on the values +1 and -1. Question: How many iid copies of this pair does an algorithm need to (i) find a classifier that yields close to optimal performance with high probability (ii) find two score functions, one trained with the positive examples only, the other with the negative examples only such that the sign of the difference of the two score functions gives a classifier that is almost optimal with high probability?
The result in the paper is that there exists a class of distributions, parameterized by d (determining the dimension of samples) such that there is a discriminative algorithm (tuned to this class) that can learn the correct classifier with only $2log(2/\delta)$ samples, while the number of samples required for any generative classifier is at least d.
Since it is clear that the requirements of generative learning are stronger than those of discriminative learning, it follows that in the above framework discriminative learning is strictly “easier” than generative learning.
The distribution concentrates on O(d) samples and the main idea is that the joint knowledge of positive and negative samples suffices for the easy identification of the target distribution (hence, classifier), while knowing only either the positive or negative examples alone is insufficient. Two special inputs, both marked for easy of recognition, determine the full distribution jointly but one of the inputs is in the positive set, the other is in the negative set and the knowledge of only one of them is insufficient to learning the otherwise “difficult” to learn distribution.
Although the construction given in the paper is simple and works as intended, it is arguably “artificial and contrived”, as it was also noted by the authors. In particular, does a similar result hold when we consider a continuous domain of a fixed dimension, and/or we restrict the class of algorithms to consistent ones? Further, the example shows more the limitation of algorithms that learn from positive and negative samples independently of each other than the limitation of generative algorithms (generative algorithms traditionally refer to learners that estimate the joint distribution of the inputs and outputs).
The question of whether generative or discriminative algorithms are more efficient are particularly interesting in light of an earlier paper by Andrew Y. Ng and Michael Jordan (“On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes”, NIPS-14, 2002). In this paper the authors compare one particular discriminative algorithm with another particular algorithm that is “generative”. The two algorithms are logistic regression and naive Bayes and together they form what is called a “Generative-Discriminative” pair. The meaning of this is that while naive Bayes maximizes the total joint log-likelihood, $\sum_{i=1}^n \log p_\theta(x_i,y_i)$ over the samples, logistic regression maximizes the total conditional likelihood, $\sum_{i=1}^n \log p_\theta(y_i|x_i)$ over the same parametric model. In the case of these two particular algorithms the parametric model is written in terms of $p(x|y)$ and asserts independence of the individual components of the feature-vector $x$. (For continuous spaces the individual feature distributions, $p(x_k|y)$, are modeled as Gaussians with unknown mean and variance.) The agnostic setting is considered (the input distribution is not restricted). Since both learners pick a classifier from the very same class of classifiers and the discriminative learner in the limit of an infinite number of samples converges to the best classifier from this class, it follows that the ultimate loss suffered by the discriminative learner is never higher than that suffered by the generative learner. Hence, it seems that if the naive Bayes assumption made by the generative method is not met, the discriminative method can have an edge — at least ultimately (open issue: give an example that shows positive separation!). However, this is just half of the story: the generative model may converge faster! In particular, the authors state an upper bound on the convergence of loss for the generative model that scales with $\sqrt{1/n \log(d)}$ ($d$ is the number of components of $x$), while as follows from standard uniform convergence results, the same convergence rate for the discriminative method is $\sqrt{d/n \log(d/n)}$. They argue that this result follows since the hypothesis class has a VC-dimension of $d$. Note the difference in the way the two bounds scale with $d$, the dimension of the input space: In the case of the discriminative algorithm the scaling is (log-)linear with $d$, while in the case of the generative algorithm it is logarithmic in $d$. (Strictly speaking, the upper bound would call for a proof since logistic regression is not a risk-minimization algorithm for the 0/1 loss and the cited theory has been worked out for risk-minimization algorithms.) Since comparing upper bounds is not really appropriate, they point to existing lower bounds that show that the worst-case sample complexity is lower bounded by $O(d)$. (My favorite paper on this is by Antos and Lugosi: “Strong minimax lower bounds for learning”: link.) Hence, the conclusion of Ng and Jordan is that, contrary to the widely held belief, generative learning can sometimes be more efficient than discriminative learning; at least when the number of features is large compared to the number of samples and the ultimate loss of the generative learner is not much higher than that of the discriminative learner. They also performed experiments on UCI datasets to validate this claim. The experiments show that the performance of naive Bayes is indeed often better when the number of examples is smaller. Unfortunately, the figures show the loss as a function of the number of samples only and hence validate the theory only half-way since for a full validation we should know how the dimension of the dataset influences the performance.
A similar conclusion to that of Ng and Jordan was shown to hold in an earlier KDD-97 paper titled “Discriminative vs. Informative Learning” by Y. Dan Rubinstein and Trevor Hastie (link) who did a case study with simulated and real examples.
The convergence rate comparison looks quite intriguing at the first sight. However, some more thinking reveals that the story is far from being finished. To see this consider the “rate” function $r(n,d) = L_{gen}$ if $n\le D-1$, $r(n,d) =\min(L_{gen},\sqrt{d/n})$ if $n\ge D$. Imagine that $r(n,d)$ is an upper bound on the loss of some learner, where $L_{gen}$ is the loss of the generative learner. The rate of convergence then is $\sqrt{d/n}$, not contradicting with the lower bounds, but clearly, this discriminative learner will be competitive with the generative learner.
So is discriminative learning more (sample) efficient than generative learning? It seems that sometimes having both the positive and negative samples at hand can be useful. However, generative learning might be advantageous when the model considered fits well the data. Hence, the case is not yet settled. Many interesting open questions!