You are currently browsing the category archive for the ‘Machine Learning’ category.
 Deep Learning Master Class
 Advances in Variational Inference
 Numerical Optimization: Understanding LBFGS
 An exact mapping between the Variational Renormalization Group and Deep Learning
 New ASA Guidelines for Undergraduate Statistics Programs
 奇异值分解（We Recommend a Singular Value Decomposition）
 如何简单形象又有趣地讲解神经网络是什么？
 Academic vs. Industry Careers
 Hadley Wickham: Impact the world by being useful
 Statisticians in World War II: They also served
 A Brief Overview of Deep Learning
 Advice for applying Machine Learning
 Deep Learning Tutorial
 Gibbs Sampling in Haskell
 Howto go parallel in R – basics + tips
There has been a Machine Learning (ML) reading list of books in hacker news for a while, where Professor Michael I. Jordan recommend some books to start on ML for people who are going to devote many decades of their lives to the field, and who want to get to the research frontier fairly quickly. Recently he articulated the relationship between CS and Stats amazingly well in his recent reddit AMA, in which he also added some books that dig still further into foundational topics. I just list them here for people’s convenience and my own reference.
 Frequentist Statistics
 Casella, G. and Berger, R.L. (2001). “Statistical Inference” Duxbury Press.—Intermediatelevel statistics book.
 Ferguson, T. (1996). “A Course in Large Sample Theory” Chapman & Hall/CRC.—For a slightly more advanced book that’s quite clear on mathematical techniques.
 Lehmann, E. (2004). “Elements of LargeSample Theory” Springer.—About asymptotics which is a good starting place.
 Vaart, A.W. van der (1998). “Asymptotic Statistics” Cambridge.—A book that shows how many ideas in inference (M estimation, the bootstrap, semiparametrics, etc) repose on top of empirical process theory.
 Tsybakov, Alexandre B. (2008) “Introduction to Nonparametric Estimation” Springer.—Tools for obtaining lower bounds on estimators.
 B. Efron (2010) “LargeScale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction” Cambridge,.—A thoughtprovoking book.
 Bayesian Statistics
 Gelman, A. et al. (2003). “Bayesian Data Analysis” Chapman & Hall/CRC.—About Bayesian.
 Robert, C. and Casella, G. (2005). “Monte Carlo Statistical Methods” Springer.—about Bayesian computation.
 Probability Theory
 Grimmett, G. and Stirzaker, D. (2001). “Probability and Random Processes” Oxford.—Intermediatelevel probability book.
 Pollard, D. (2001). “A User’s Guide to Measure Theoretic Probability” Cambridge.—More advanced level probability book.
 Durrett, R. (2005). “Probability: Theory and Examples” Duxbury.—Standard advanced probability book.
 Optimization
 Bertsimas, D. and Tsitsiklis, J. (1997). “Introduction to Linear Optimization” Athena.—A good starting book on linear optimization that will prepare you for convex optimization.
 Boyd, S. and Vandenberghe, L. (2004). “Convex Optimization” Cambridge.
 Y. Nesterov and Iu E. Nesterov (2003). “Introductory Lectures on Convex Optimization” Springer.—A start to understand lower bounds in optimization.
 Linear Algebra
 Golub, G., and Van Loan, C. (1996). “Matrix Computations” Johns Hopkins.—Getting a full understanding of algorithmic linear algebra is also important.
 Information Theory
 Cover, T. and Thomas, J. “Elements of Information Theory” Wiley.—Classic information theory.
 Functional Analysis
 Kreyszig, E. (1989). “Introductory Functional Analysis with Applications” Wiley.—Functional analysis is essentially linear algebra in infinite dimensions, and it’s necessary for kernel methods, for nonparametric Bayesian methods, and for various other topics.
Remarks from Professor Jordan: “not only do I think that you should eventually read all of these books (or some similar list that reflects your own view of foundations), but I think that you should read all of them three times—the first time you barely understand, the second time you start to get it, and the third time it all seems obvious.”
pvalue and Bayes are the two hottest words in Statistics. Actually I still can not get why the debate between frequentist statistics and Bayesian statistics can last so long. What is the essence arguments behind it? (Any one can help me with this?) In my point of view, they are just two ways for solving practical problems. Frequentist people are using the random version of proofbycontradiction argument (i.e. small pvalue indicates less likeliness for the null hypothesis to be true), while Bayesian people are using learning argument to update their believes through data. Besides, mathematician are using partial differential equations (PDE) to model the real underlying process for the analysis. These are just different methodologies for dealing with practical problems. What’s the point for the longlast debate between frequentist statistics and Bayesian statistics then?
Although my current research area is mostly in frequentist statistics domain, I am becoming more and more Bayesian lover, since it’s so natural. When I was teaching introductory statistics courses for undergraduate students at Michigan State University, I divided the whole course into three parts: Exploratory Data Analysis (EDA) by using R software, Bayesian Reasoning and Frequentist Statistics. I found that at the end of the semester, the most impressive example in my students mind was the one from the second section (Bayesian Reasoning). That is the Monty Hall problem, which was mentioned in the article that just came out in the NYT. (Note that about the argument from Professor Andrew Gelman, please also check out the response from Professor Gelman.) “Mr. Hall, longtime host of the game show “Let’s Make a Deal,” hides a car behind one of three doors and a goat behind each of the other two. The contestant picks Door No. 1, but before opening it, Mr. Hall opens Door No. 2 to reveal a goat. Should the contestant stick with No. 1 or switch to No. 3, or does it matter?” And the Bayesian approach to this problem “would start with onethird odds that any given door hides the car, then update that knowledge with the new data: Door No. 2 had a goat. The odds that the contestant guessed right — that the car is behind No. 1 — remain one in three. Thus, the odds that she guessed wrong are two in three. And if she guessed wrong, the car must be behind Door No. 3. So she should indeed switch.” What a natural argument! Bayesian babies and Google untrained search for youtube cats (the methods of deep learning) are all excellent examples proving that Bayesian Statistics IS a remarkable way for solving problems.
What about the pvalues? This random version of proofbycontradiction argument is also a great way for solving problems from the fact that it have been helping solve so many problems from various scientific areas, especially in bioworld. Check out today’s post from Simply Statistics: “You think Pvalues are bad? I say show me the data,” and also the early one: On the scalability of statistical procedures: why the pvalue bashers just don’t get it.
The classical pvalue does exactly what it says. But it is a statement about what would happen if there were no true effect. That can’t tell you about your longterm probability of making a fool of yourself, simply because sometimes there really is an effect. You make a fool of yourself if you declare that you have discovered something, when all you are observing is random chance. From this point of view, what matters is the probability that, when you find that a result is “statistically significant”, there is actually a real effect. If you find a “significant” result when there is nothing but chance at play, your result is a false positive, and the chance of getting a false positive is often alarmingly high. This probability will be called “false discovery rate” (or error rate), which is different with the concept in the multiple comparison. One possible misinterpretation of pvalue is regarding pvalue as the false discovery rate, which may be much higher than pvalue. Think about the Bayes formula and the tree diagram you learned in introductory course to statistics to figure out the relationship between pvalue and the “false discovery rate”.
 Interview with Nick Chamandy, statistician at Google
 You and Your Research + video
 Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
 A Survival Guide to Starting and Finishing a PhD
 Six Rules For Wearing Suits For Beginners
 Why I Created C++
 More advice to scientists on blogging
 Software engineering practices for graduate students
 Statistics Matter
 What statistics should do about big data: problem forward not solution backward
 How signals, geometry, and topology are influencing data science
 The Bounded Gaps Between Primes Theorem has been proved
 A noncomprehensive list of awesome things other people did this year.
 Jake VanderPlas writes about the Big Data Brain Drain from academia.
 Tomorrow’s Professor Postings
 Best Practices for Scientific Computing
 Some tips for new researchoriented grad students
 3 Reasons Every Grad Student Should Learn WordPress
 How to Lie With Statistics (in the Age of Big Data)
 The Geometric View on Sparse Recovery
 The Mathematical Shape of Things to Come
 A Guide to Python Frameworks for Hadoop
 Statistics, geometry and computer science.
 How to Collaborate On GitHub
 Step by step to build my first R Hadoop System
 Open Sourcing a Python Project the Right Way
 Data Science MD July Recap: Python and R Meetup
 git 最近感悟
 10 Reasons Python Rocks for Research (And a Few Reasons it Doesn’t)
 Effective Presentations – Part 2 – Preparing Conference Presentations
 Doing Statistical Research
 How to Do Statistical Research
 Learning new skills
 How to Stand Out When Applying for An Academic Job
 Maturing from student to researcher
 False discovery rate regression (cc NSA’s PRISM)
 Job Hunting Advice, Pt. 3: Networking
 Getting Started with Git
In my office I have two NIPS posters on the wall, 2011 and 2012. But I have not been there and I am not computer scientist neither. But anyway I like NIPS without reason. Now it’s time for me to organize posts from others:
 NIPS ruminations I
 NIPS II: Deep Learning and the evolution of data models
 NIPS stuff…
 NIPS 2012
 NIPS 2012 Conference in Lake Tahoe, NV
 Thoughts on NIPS 2012
 The Big NIPS Post
 NIPS 2012 : day one
 NIPS 2012 : day two
 Spectral Methods for Latent Models
 NIPS 2012 Trends
And among all of the posts, there are several things I have to digest later on:
 One tutorial on Random Matrices, by Joel Tropp. People concluded in their posts that
Basically, break random matrices down into a sum of simpler, independent random matrices, then apply concentration bounds on the sum.—John Moeller. The basic result is that if you love your Chernoff bounds and Bernstein inequalities for (sums of) scalars, you can get almost exactly the same results for (sums of) matrices.—hal .
 “This year was definitely all about Deep Learning,” John Moeller said. The Geomblog mentioned that it’s been in the news recently because of the Google untrained search for youtube cats, the methods of deep learning (basically neural nets without lots of back propagation) have been growing in popularity over a long while. And we have to spend sometime to read Deep Learning and the evolution of data models, which is related with manifold learning.
 “Another trend that’s been around for a while, but was striking to me, was the detailed study of Optimization methods.”—The Geomblog. There are at least two different workshops on optimization in machine learning (DISC and OPT), and numerous papers that very carefully examined the structure of optimizations to squeeze out empirical improvements.
 Kernel distances: An introduction to kernel distance from The Geomblog. “Scott Aaronson (at his NIPS invited talk) made this joke about how nature loves ℓ2. The kernel distance is “essentially” the ℓ2 variant of EMD (which makes so many things easier). There’s been a series of papers by Sriperumbudur et al. on this topic, and in a series of works they have shown that (a) the kernel distance captures the notion of “distance covariance” that has become popular in statistics as a way of testing independence of distributions. (b) as an estimator of distance between distributions, the kernel distance has more efficient estimators than (say) the EMD because its estimator can be computed in closed form instead of needing an algorithm that solves a transportation problem and (c ) the kernel that optimizes the efficient of the twosample estimator can also be determined (the NIPS paper).”
 Spectral Methods for Latent Models: Spectral methods for latent variable models are based upon the method of moments rather than maximum likelihood.
Besides the papers mentioned in the above hot topics, there are some other papers from Memming‘s post:
 Graphical models via generalized linear models: Eunho introduced a family of graphical models with GLM marginals and Ising model style pairwise interaction. He said the PoissonMarkovRandomFields version must have negative coupling, otherwise the log partition function blows up. He showed conditions for which the graph structure can be recovered with high probability in this family.
 TCA: High dimensional principal component analysis for nongaussian data: Using an elliptical copula model (extending the nonparanormal), the eigenvectors of the covariance of the copula variables can be estimated from Kendall’s tau statistic which is invariant to the nonlinearity of the elliptical distribution and the transformation of the marginals. This estimator achieves close to the parametric convergence rate while being a semiparametric model.
Update: Make sure to check the lectures from the prominent 26th Annual NIPS Conference filmed @ Lake Tahoe 2012. Also make sure to check the NIPS 2012 Workshops, Oral sessions and Spotlight sessions which were collected for the Video Journal of Machine Learning Abstracts – Volume 3.
The following four big issues related with big data are really taking the big four aspects into consideration:
 Jelani Nelson, “Sketching and streaming algorithms for processing massive data”
 Ronitt Rubinfeld, “Taming big probability distributions”
 Jeff Ullman, “Designing good MapReduce algorithms”
 Ashwin Machanavajjhala and Jerome P. Reiter, “Big Privacy”
From XRDS.
And how to deal with the above four big issues? Here is a post about the Five Trendy Open Source Technologies to help you to deal with big data.
Today I saw a link question from reddit: How important is Java/C++ vs just using R/Matlab for big data? I learned C++ and Matlab when I was undergraduate and I am now using R by self learning as a PhD student in Stats Department. But living in this big data time, R is really not enough for scientific computing. Hence this link question is really what I want to know. Here I want to organize the interesting materials, including posts, about the programming, especially R and C++.
First I want to mention that top projects languages in GitHub: JavaScript 20%, Ruby 14%, Python 9%, Shell 8%, Java 8%, PHP 7%, C 7%, C++ 4%, Perl 4%, ObjectiveC 3% among lots of other languages including R, Julia, Matlab. But for me, I only know about C and C++ among these Top 10 languages. For learning for people like me, I give the description list as follows:
 JavaScript
Javascript is an ojbectoriented, scripting programming language that runs in your web browser. It runs on a simplified set of commands, easier to code and doesn’t require compiling. It’s an important language since it’s embedded into html that happens to to used in millions of web pages to validate forms, create cookies, detect browsers and improve page design and formatting. Big plus, it’s easy to learn and use.  Ruby and Ruby on Rails
Ruby is a dynamic, objectoriented, opensource programming language; Ruby on Rails is an opensource Web application framework written in Ruby that closely follows the MVC (ModelViewController) architecture. With a focus on simplicity, productivity and letting the computers do the work, in a few years, its usage has spread quickly. Ruby is very similar to Python, but with different syntax and libraries. There’s little reason to learn both, so unless you have a specific reason to choose Ruby (i.e. if this is the language your colleagues all use), I’d go with Python.Ruby on Rails is one of the most popular web development frameworks out there, so if you’re looking to do primarily web development you should compare Django (Python framework) and RoR first.
 Python
Python is an interpreted, dynamicallytyped programming language. Python programs stress code readability, so even nonprogrammers should be able to decipher a Python program with relative ease. This also makes the language one of the easiest to learn and write code in quickly. Python is very popular and has a strong set of libraries for everything from numerical and symbolic computing to data visualization and graphical user interfaces.  Java
Java is an objectoriented programming language developed by James Gosling and colleagues at Sun Microsystems in the early 1990s. Why you should learn it: Hailed by many developers as a “beautiful” language, it is central to the non.Net programming experience. Learning Java is critical if you are nonMicrosoft.  PHP
What is PHP? PHP is an opensource, server side html scripting language well suited for web developers as it can easily be embedded into standard html pages. You can run 100% dynamic pages or hybrid pages, 50% html + 50% php.  C
C is a standardized, generalpurpose programming language. It’s one of the most pervasive languages and the basis for several others (such as C++). It’s important to learn C. Once you do, making the jump to Java or C# is fairly easy, because a lot of the syntax is common. C is a lowlevel, statically typed, compiled language. The main benefit of C is its speed, so it’s useful for tasks that are very computationally intensive. Because it’s compiled into an executable, it’s also easier to distribute C programs than programs written in interpreted languages like Python. The tradeoff of increased speed is decreased programmer efficiency. C++ is C with some additional objectoriented features built in. It can be slower than C, but the two are pretty comparable, so it’s up to you whether these additional features are worth it.  Perl
Perl is an opensource, crossplatform, serverside interpretive programming language used extensively to process text through CGI programs. Perls power in processing of piles of text has made it very popular and widely used to write Web server programs for a range of tasks.
This rank is only for the users on GitHub, which is biased for you. For me, I think C/C++, R, Julia, Matlab, Java, Python, Perl will be popular among stats sphere.
 Advice on learning C++ from an R background
 Integrating C or C++ into R, where to start?
 R for testing, C++ for implementation?
 Some thoughts on Java—compared with C++
 A list of RSS C++ blogs
 Get started with C++ AMP
 C++11 Concurrency Series
 Google’s Python Class and Google’s C++ Class from Google Code University
 Integrating R and C++
 Learn Python on Codecademy
 Learn How to Code Without Leaving Your Browser
 Minimal Advice to Undergrads on Programming
 Learning R Via Python (or the other way around).
 Bloom teaches Python for Scientific Computing at Berkeley (available as a podcast).

What are the three most important programming languages to learn?—The following is from Waleed Kadous, PhD in Computer Science:I would focus on learning three classes of languages to really understand the nature of programming and to have a decent toolkit. Everything else is basically variants on that.Learn a lowlevel language so you understand what goes on at the bare metal and so you can make hardware dance
The obvious choice here is C, but assembly language might also be good.Learn a language for architecting large systems
If you want to build large code bases, you’re going to need one of the strongly typed languages. Personally, I think Java is the best choice here; but C++, Scala and even Ada are acceptable.Learn a language for scripting things together quickly
There are a few choices here: shell, Python, Perl, Lua. Any of these will do, but Python is probably the foremost. These are great for gluing existing pieces together.Now, if you only get three, that’s it. But I’m going to suggest two more categories.
Learn a language that forces you to think differently about programming
These are majorly different world perspectives. Examples here would be functional programming, like Haskell, ML, etc, but also logic programming like Prolog.Learn a language that lets you build webbased applications quickly
This could be web2py or Javascript — but the ability to quickly hack together a web demo is really useful today.
The following from Revolutions:
John Myles White, selfdescribed “statistics hacker” and coauthor of “Machine Learning for Hackers” was interviewed recently by The Setup. In the interview, he describes his some of his goto R packages for data science:
Most of my work involves programming, so programming languages and their libraries are the bulk of the software I use. I primarily program in R, but, if the situation calls for it, I’ll use Matlab, Ruby or Python. …
That said, for me the specific language I use is much less important than the libraries availble for that language. In R, I do most of my graphics using ggplot2, and I clean my data using plyr, reshape, lubridate and stringr. I do most of my analysis using rjags, which interfaces with JAGS, and I’ll sometimes use glmnet for regression modeling. And, of course, I use ProjectTemplate to organize all of my statistical modeling work. To do text analysis, I’ll use the tm and lda packages.
Also in JMW’s toolbox: Julia, TextMate 2, MySQL, Dropbox and a beefy MacBook. Read the full interview linked below for an insightful look at how he uses these and other tools day to day.
The Setup / Interview: John Myles White
 M. A. Álvarez, L. Rosasco and N. D. Lawrence, Kernels for vectorvalued functions: a review, tech report, 2011.
 A. Argyriou, M. Pontil, and C.A. Micchelli, When is there a representer theorem? Vector versus matrix regularizers, Journal of Machine Learning Research, 10:25072529, 2009.
 G. Bakir, T. Hofmann, B. Schölkopf, A. Smola, B. Taskar and S. Vishwanathan (Eds.), Predicting Structured Data, MIT Press, 2007.
 C. Brouard, F. d’AlcheBuc and M. Szafranski, Semisupervised Penalized Output Kernel Regression for Link Prediction, in Proceedings of the 28^{th} International Conference on Machine Learning (ICML 2011), Bellevue, WA, USA, 2011.
 A. Caponnetto, M. Pontil, C.A. Micchelli and Y. Ying, Universal multitask kernels, Journal of Machine Learning Research, 9:16151646, 2008.
 S. DaboNiang and F. Ferraty (Eds.), Functional and Operatorial Statistics, SpringerVerlag, NewYork, 2008.
 P. Geurts, L. Wehenkel and F. d’AlchéBuc, Kernelizing outputs of treebased methods, in Proceedings of the 23^{rd} International Conference on Machine Learning (ICML 2006), Pittsburgh, PA, USA, 2006. ACM 2006, pp. 345352.
 P. Geurts, L. Wehenkel and F. d’AlchéBuc, Gradient Boosting for Kernelized Output Spaces, in Proceedings of the 24^{th} International Conference on Machine Learning (ICML 2007), Corvallis, Oregon, USA, 2007.
 F. Ferraty, A. Laksaci, A. Tadj and P. Vieu, Kernel regression with functional response, Electronic Journal of Statistics, 5, 159171, 2011.
 S. Jung, M. Foskey and J. S. Marron, Principal Arc Analysis on direct product manifolds, The Annals of Applied Statistics, 5, 578603,2011.
 H. Kadri, A. Rabaoui, P. Preux, E. Duflos and A. Rakotomamonjy, Functional Regularized Least Squares Classication with Operatorvalued Kernels, in Proceedings of the 28^{th} International Conference on Machine Learning (ICML 2011), Bellevue, WA, USA, 2011.
 H. Kadri, E. Duflos, P. Preux, S. Canu and M. Davy, Nonlinear functional regression: a functional RKHS approach. In Proceedings of the 13^{th} International Conference on Artificial Intelligence and Statistics (AISTATS), Italy, 2010.
 T. Kato, Perturbation theory for linear operators, SpringerVerlag, Berlin, 1966.
 C.A. Micchelli and M. Pontil, On learning vectorvalued functions, Neural Computation, 17:177204, 2005.
 J. O. Ramsay and B. W. Silverman, Functional Data Analysis, SpringerVerlag, 2nd ed., 2005.
There is a workshop for this:
1, Is a PhD worth it in machine learning?
It’s about the puzzle of a statistics graduate who wants to have a degree in Machine learning.
2, What is the use of manifold learning?
Why we need manifold learning? What’s the biggest advantage of manifold machine learning?
3, Is the pvalue a good measure of evidence?
“Statistics abounds criteria for assessing quality of estimators, tests, forecasting rules, classification algorithms, but besides the likelihood principle discussions, it seems to be almost silent on what criteria should a good measure of evidence satisfy.” M. Grendár
4, What’s the relationship between Principal Component Analysis (PCA) and Ordinary Least Squares (OLS)?
Recent Comments