In my previous post on learning about statistical learning, I took a shot at creating a guide for those who want to learn about statistical learning. I received a lot of great feedback from friends in the field, the comments on my blog, and discussions around the web like the hackernews thread. This revised edition of the post takes these discussions and feedback into account.
I humbly submit this guide, upon material over which I do not have full mastery, in the case that it may be of use to others trying to find their way.
[ Student working in Statistics Machine Room. source: flickr ]
Fragmentation and Ambiguity: what’s in a name?
I renamed this second edition of my post to “Learning about Machine Learning” from “Learning about Statistical Learning.” What is the better name? I don’t know. Things are a bit fragmented. There are some nuances and some of the fragmentation makes sense. In my opinion, most of it doesn’t. You have “computational learning,” “computational statistics,” “statistical learning,” “machine learning” and “learning theory”. You have Statistics vs. Machine Learning, the culture of algorithms vs. machine learning, the stats handicap, and COMPUTATIONAL statistics vs. computational STATISTICS.
It can be confusing. I bring this up to help you preempt the confusion. When you see these different terms, don’t worry about it, just focus on the topics you need to learn. In this guide I will try to help point out where you will see the same thing by different names.
Straight into the Deep End
In the previous list, I thought it would be good to recommend some lighter texts as introductions to topics like probability theory and machine learning. Based upon subsequent discussions and feedback, I’ve changed my view. Straight into the deep end is the way to go.
The caveat is that you need to be honest with yourself about when you encounter something that you do not understand and you are unable to work through despite trying your best. At this point you’ve found your way to some material that you do not have the foundation for. Put the book down, insert a bookmark. Close the book. Walk away. Go find a resource to build the foundation. Come back later. Repeat.
Don’t be afraid to admit you don’t understand something, that is the only way you are going to move forward and really understand the material. As Peter Norvig says about programming, be prepared to make a serious commitment of time and effort if you want to be a researcher.
I am making a number of changes which I mention along the way. The overall list is now structured by topic with inline notes and there are a number of new topics so that you can pick and choose based upon your areas of interest.
Now on to the books.
I’m moving the foundation in proofs and analysis up to be a the top priority, as well as adding some new references from the analysis list. You won’t get far if you are intimidated by Greek characters and unable to follow proofs. If you build an intuition for programming as proof writing, that will also bend your mind in the right direction. I’m including a link to one of the fun puzzle books that I find helpful – logic puzzles can help get you into the flow of proofs. I’m not including a book on functional analysis here, but I have a few you can choose from on the analysis list if you are interested.
Introduction to Analysis – Maxwell Rosenlicht
Mathematical Analysis – Apostol
How to Prove It: A Structured Approach – Daniel J. Velleman
How to Solve it – Gyorgy Polya
Proofs are Programs [pdf] – Philip Wadler
The Lady and the Tiger – Raymond Smullyan
You will be working with algorithms constantly, so work through a couple books from the algorithms and data structures list. We are also in an age of rapidly increasing data volume, so I am also recommending a book to help you learn how decompose algorithms and work with Map Reduce style parallelism – a great read which is not published yet but you can get the freePDF online.
Introduction to Algorithms by Charles E. Leiserson
The Algorithms Design Manual – Steven Skiena
Data-Intensive Text Processing with MapReduce – Jimmy Lin and Chris Dyer
Map–Reduce for Machine Learning on Multicore – from a group in the Stanford CS department
Linear algebra is critical – matrices are your friends. A lot of what you will see is represented and computed naturally in matrix notation – it is convenient for both theory and practice. You can find a variety of items in the matrix fu list, and you can also find Gilbert Strag’s lectures online from MIT courseware. Matrix algebra, like algorithms, is a fundamental building block that you will work with on a daily basis.
In this second edition of my post, I’m adding a strong focus on linear models, as they are the foundation for a lot of models that you will encounter in machine learning. The books in the machine learning section below will help to put linear models in a broader setting. For example Chapter 2 of The Elements of Statistical Learning has a nice overview section of least squares and nearest neighbors that helps you understand how many other techniques are based upon these two simple building blocks. Some sources also have nice discussions of the relationship between linear models and other approaches like neural networks. Starting with a strong foundation in linear models is wise. I had not done so properly when jumping into machine learning, and it has required a lot of backtracking.
There are many titles to choose from the probability list, in order to build a base in probability theory. I’m also adding a reference for looking at probability from the Bayesian perspective. Probability can be very counter-intuitive. A great way to get around this is to train yourself to understand how we misinterpret things. Read one of theKaneman and Tversky books from my finance, trading and economics list, and work through some puzzle books. Aside from psycological biases, a lot of the trickiness of empirical probability, especially conditional probabilities, is knowing what and how to count. If you want to explore counting, there a number of good books on the probability list.
A Course in Probability Theory, Revised Edition, Second Edition by Kai Lai Chung
First Look at Rigorous Probability Theory by Jeffrey S. Rosenthal
Probability Theory: The Logic of Science – E. T. Jaynes
When you are reading to build an understanding of statistics on top of probability theory, move on to the statistics list. Statistics and inference are critical in machine learning, and it is also nice to understand how information theory fits into this picture. I haven’t included an information theory book here, but you can find a nice one that fits information theory with statistics on the statistics list. A lot of bits and pieces on statistics will have been covered already with probability theory and linear models. I’ve added a reference on hypothesis testing and multivariate statistics. I’ve also created a new new section below on robust and non-parametric statistics.
All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics) by Larry Wasserman
Testing Statistical Hypothesis – E. L. Lehmann
Bayesian Statistics – William M. Bolstad
Modern Multivariate Statistical Techniques – Alan Julian Izenman
Non-parametric and robust statistics are very neat topics. Non-parametric statistics, can mean a few things but the core intuition is that the techniques do not assume a parametrized model for the probability distribution underlying the observed data. The notion of robust statistics stems from the desired to create statistical methods that are 1) resistant to the deviation from assumptions that empirical data often exhibit, and 2) tolerant to outliers as can be measured by the breakdown point – the point at which infinitely large outliers causes an estimator to yield an infinitely large result.
Robust Statistics – Peter J. Huber
Robust Statistics: The Approach Based on Influence Functions – Frank R. Hampel
Robust Regression – Peter J. Rousseeuw
All of Nonparametric Statistics – Larry Wasserman
I’ve replaced Mitchell with Bishop as a companion for EoSL. There are more to choose from on the machine learning list if you’re feeling motivated, but the three below give a very solid coverage of most of the core material. MacKay’s book is a great help with information theory if you haven’t yet gotten a solid understanding by the time you get to this point.
Pattern Recognition and Machine Learning – Christopher Bishop
Information Theory, Inference and Leaning Algorithms – David J. C. MacKay
The Elements of Statistical Learning by Robert Tibshirani, et al.
Understand features (or predictors, input variables, whatever you prefer to call them) and feature selection. Know your different strategies for dealing with continuous vs. discrete features. Feature selection (especially “embedded methods”), “sparsification” and “shrinkage methods” are all more or less different sides of the same hyper-coin. These ideas seek to eliminate the impact of irrelevant features upon your model by either marginalizing them out entirely or at least reducing their influence by reducing the parameter values for less relevant predictors. The two books below are focused on this problem, but you will also find a lot about regularized regression and other such techniques in other machine learning and statistical learning texts.
Feature Extraction – Isabelle Guyon, et al.
Computational Methods of Feature Selection – Huan Liu, Hiroshi Motoda
Heuristic search and optimization are important topics, and fortunately I have found a great survey text that I am working my way through right now, which is included below from the heuristic search and optimization list. I am also including a link that a friend pointed out to a book and lecture slides on convex optimization that you can get online.
Introduction to Stochastic Search and Optimization by James C. Spall
Convex Optimization (online) – Stephen Boyd
I’ve added two texts on Bayesian networks and graphical models. Graphical models are a cool family of techniques that let you model rich dependence structures between input variables by relying on a graph structure to represent conditional independence.
For those working with text, I’ve added a books from the IR and NLP list – each of which I think represents the strongest single introductory text and reference. Information Retrieval (IR) and Natural Language Processing (NLP) are active areas of study that continue to become and increasingly important as a result of the internet. We use IR every day in search engines, but I think new forms of retrieval systems using statistical NLP are just around the corner.
Foundations of Statistical Natural Language Processing – Christopher D. Manning
Network Theory is another booming area accelerated by the Internet. Social, information and transportation networks all present very challenging and important problems. As especially ripe and tricky topic is the evolution of networks over time, or what is sometimes called “longitudinal analysis” by the social networks folks. Network theory is extremely useful for determining the importance of information or people within networks – for example, link analysis algorithms like PageRank are based upon network theory. I’ve tossed in a book on graph theory in case you need to brush up – you will need it for networks.
Introductory Graph Theory – Gary Chartrand
Social Network Analysis – Stanley Wasserman
Statistical Analysis of Network Data – Eric D. Kolaczyk
Network Flows – Ravindra K. Ahuja
Fundamentals in Queuing Theory – Donald Gross, et al.
You will have to write a lot of code. Learning how to do good engineering work is more important than people realize. Learn some programming languages; python (numpy/scipy), R, and at least one nice functional language (probably Haskell, Clojure, or OCaml). Learn how to work with Linux. Learn about design and testing. Picking up some things on the lean and agile software development approach will likely help – I use it for research projects just as much as software. As I mentioned earlier, algorithms and data structures are critical, and you may also want to learn about parallelism and distributed systems if you work with large data sets.
[ Students working with large data sets. source: flickr ]
You can find a lot more on my Amazon lists and ever-expanding Amazon wish lists. If you are involved in theoretical or applied research in academia or industry, and you disagree with some of these book selections, have alternate recommendations, or you think I am missing some topics, I’d love to hear.
Lastly, here are Michael Jordan‘s recommended readings via the hackernews thread that sprung up following my last post. He knows a lot more about machine learning than I do, so perhaps you should just ignore the above recommendations and listen to him.
I personally think that everyone in machine learning should be (completely) familiar with essentially all of the material in the following intermediate-level statistics book:
1.) Casella, G. and Berger, R.L. (2001). “Statistical Inference” Duxbury Press.
For a slightly more advanced book that’s quite clear on mathematical techniques, the following book is quite good:
2.) Ferguson, T. (1996). “A Course in Large Sample Theory” Chapman & Hall/CRC.
You’ll need to learn something about asymptotics at some point, and a good starting place is:
3.) Lehmann, E. (2004). “Elements of Large-Sample Theory” Springer.
Those are all frequentist books. You should also read something Bayesian:
4.) Gelman, A. et al. (2003). “Bayesian Data Analysis” Chapman & Hall/CRC.
and you should start to read about Bayesian computation:
5.) Robert, C. and Casella, G. (2005). “Monte Carlo Statistical Methods” Springer.
On the probability front, a good intermediate text is:
6.) Grimmett, G. and Stirzaker, D. (2001). “Probability and Random Processes” Oxford.
At a more advanced level, a very good text is the following:
7.) Pollard, D. (2001). “A User’s Guide to Measure Theoretic Probability” Cambridge.
The standard advanced textbook is Durrett, R. (2005). “Probability: Theory and Examples” Duxbury.
Machine learning research also reposes on optimization theory. A good starting book on linear optimization that will prepare you for convex optimization:
8.) Bertsimas, D. and Tsitsiklis, J. (1997). “Introduction to Linear Optimization” Athena.
And then you can graduate to:
9.) Boyd, S. and Vandenberghe, L. (2004). “Convex Optimization” Cambridge.
Getting a full understanding of algorithmic linear algebra is also important. At some point you should feel familiar with most of the material in
10.) Golub, G., and Van Loan, C. (1996). “Matrix Computations” Johns Hopkins.
It’s good to know some information theory. The classic is:
11.) Cover, T. and Thomas, J. “Elements of Information Theory” Wiley.
Finally, if you want to start to learn some more abstract math, you might want to start to learn some functional analysis (if you haven’t already). Functional analysis is essentially linear algebra in infinite dimensions, and it’s necessary for kernel methods, for nonparametric Bayesian methods, and for various other topics. Here’s a book that I find very readable:
12.) Kreyszig, E. (1989). “Introductory Functional Analysis with Applications” Wiley.