You are currently browsing the category archive for the ‘Probability’ category.

There has been a Machine Learning (ML) reading list of books in hacker news for a while, where Professor Michael I. Jordan recommend some books to start on ML for people who are going to devote many decades of their lives to the field, and who want to get to the research frontier fairly quickly. Recently he articulated the relationship between CS and Stats amazingly well in his recent reddit AMA, in which he also added some books that dig still further into foundational topics. I just list them here for people’s convenience and my own reference.

• Frequentist Statistics
1. Casella, G. and Berger, R.L. (2001). “Statistical Inference” Duxbury Press.—Intermediate-level statistics book.
2. Ferguson, T. (1996). “A Course in Large Sample Theory” Chapman & Hall/CRC.—For a slightly more advanced book that’s quite clear on mathematical techniques.
3. Lehmann, E. (2004). “Elements of Large-Sample Theory” Springer.—About asymptotics which is a good starting place.
4. Vaart, A.W. van der (1998). “Asymptotic Statistics” Cambridge.—A book that shows how many ideas in inference (M estimation, the bootstrap, semiparametrics, etc) repose on top of empirical process theory.
5. Tsybakov, Alexandre B. (2008) “Introduction to Nonparametric Estimation” Springer.—Tools for obtaining lower bounds on estimators.
6. B. Efron (2010) “Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction” Cambridge,.—A thought-provoking book.
• Bayesian Statistics
1. Gelman, A. et al. (2003). “Bayesian Data Analysis” Chapman & Hall/CRC.—About Bayesian.
2. Robert, C. and Casella, G. (2005). “Monte Carlo Statistical Methods” Springer.—about Bayesian computation.
• Probability Theory
1. Grimmett, G. and Stirzaker, D. (2001). “Probability and Random Processes” Oxford.—Intermediate-level probability book.
2. Pollard, D. (2001). “A User’s Guide to Measure Theoretic Probability” Cambridge.—More advanced level probability book.
3. Durrett, R. (2005). “Probability: Theory and Examples” Duxbury.—Standard advanced probability book.
• Optimization
1. Bertsimas, D. and Tsitsiklis, J. (1997). “Introduction to Linear Optimization” Athena.—A good starting book on linear optimization that will prepare you for convex optimization.
2. Boyd, S. and Vandenberghe, L. (2004). “Convex Optimization” Cambridge.
3. Y. Nesterov and Iu E. Nesterov (2003). “Introductory Lectures on Convex Optimization” Springer.—A start to understand lower bounds in optimization.
• Linear Algebra
1. Golub, G., and Van Loan, C. (1996). “Matrix Computations” Johns Hopkins.—Getting a full understanding of algorithmic linear algebra is also important.
• Information Theory
1. Cover, T. and Thomas, J. “Elements of Information Theory” Wiley.—Classic information theory.
• Functional Analysis
1. Kreyszig, E. (1989). “Introductory Functional Analysis with Applications” Wiley.—Functional analysis is essentially linear algebra in infinite dimensions, and it’s necessary for kernel methods, for nonparametric Bayesian methods, and for various other topics.

Remarks from Professor Jordan: “not only do I think that you should eventually read all of these books (or some similar list that reflects your own view of foundations), but I think that you should read all of them three times—the first time you barely understand, the second time you start to get it, and the third time it all seems obvious.”

Today, there will be a talk,  Imaginary Geometry and the Gaussian Free Field, given by Jason Miller from Microsoft Research. I just googled it and found the following interesting materials:

1. Gaussian free fields for mathematicians
2. Gaussian free field and conformal field theory: In these expository lectures, it gives an elementary introduction to conformal field theory in the context of probability theory and complex analysis. It considers statistical fields, and defines Ward functionals in terms of their Lie derivatives. Based on this approach, it explains some equations of conformal field theory and outline their relation to SLE theory.
3. SLE and the free field: Partition functions and couplings
4. Schramm-Loewner evolution (SLE). See slides by Tom Alberts and 2006 ICM slides by Oded Schramm and St. Flour Lecture Notes by Wendelin Werner . See also Ito’s lemma notes .
There will be a talk,  “Landscape of Random Functions in Many Dimensions via Random Matrix Theory”, next week given by  Antonio Auffinger from the University of Chicago.
Abstract: How many critical values a typical Morse function have on a high dimensional manifold? Could we say anything about the topology of its level sets? In this talk I will survey a joint work with Gerard Ben Arous and Jiri Cerny that addresses these questions in a particular but fundamental example. We investigate the landscape of a general Gaussian random smooth function on the N-dimensional sphere. These corresponds to Hamiltonians of well-known models of statistical physics, i.e spherical spin glasses. Using the classical Kac-Rice formula, this counting boils down to a problem in Random Matrix Theory. This allows us to show an interesting picture for the complexity of these random Hamiltonians, for the bottom of the energy landscape, and in particular a strong correlation between the index and the critical value. We also propose a new invariant for the possible transition between the so-called 1-step replica symmetry breaking and a Full Replica symmetry breaking scheme and show how the complexity function is related to the Parisi functional.
This topic is kind of a combination of my majors, differential geometry, probability and statistics. I am interested in this although I can imagine that it is hard.

Today I want to say something basic:

1, We know in Calculus, taylor expansion is extremely useful, since it’s the polynomial approximation of functions. Thus in particular, for some limits, you could always refer to the taylor expansion first and then everything will be simple.

In probability and statistics, we know that a statistic is nothing but a function of the sample. Let X_{1}, X_{2},…,X_{n} be sample points. Then a statistic could just be expressed as T_{n}=f (X_{1}, X_{2},…,X_{n} ). So if we want to discuss the asymptotic properties of the statistic, a good way is to express the statistic in the taylor expansion first. And I think we should always do like this, i.e. taylor expansion first. Then delta method and slutsky’s lemma could be involved in for you to use together with the central limit theorems, which is the foundation for the discussion of asymptotic properties.

2, Why statistics? What the difference between statistics and probability?

In the reality, everything has noises so that it is difficult for us to see the underlying principle. For statistics, it deals with the raw data to find out the simple rule covered by the noised data. Thus if you want to find out the relationship between the heights and weights of humans,  why use regression method? That is because we regard the data we got are noised, we should not just use all the data points to find out the precise curve through every data point. That curve does not make any sense in reality. We should think of the different heights of some fixed weight as the noised data, and we want to use statistics to find out the simple relationship between these two variables for kind of prediction. Therefore, simple precise mathematics+noise will be statistics. How to model the noise, this is related to the measure theory.

The difference between probability and statistics is kind of probability is mathematics and statistics kind of data management. What does it mean? I mean probability definitely belongs to mathematics, since it is just based on axioms and rules, nothing else. But statistics is just the opposite. Started with raw data, you can deal with the data anyway without any rules. Play with the data as much as you can. But what’s the connection between these two? Statistics as a function of random variables, which is controlled by the underlying unknown rules (probability distributions), could have many properties got from the analysis using probability.

In machine learning, we often take probability for granted. We desire a system for representing uncertainty in the world, and Cox’s theorem tells us that if we accept some basic postulates regarding what is desirable in a system of uncertainty, we will end up with probability.

So that should be the end of the story… right? Well, maybe not. The first Cox postulate is

Divisibility and comparability – The plausibility of a statement is a real number and is dependent on information we have related to the statement,

which seems quite innocent. However, who’s to say that there is anything fundamental about real numbers? Real numbers have strange things like irrational numbers and negative numbers (crazy, I know), but they’re lacking in comparison to imaginary numbers (there’s no operation that you can apply 4 times before returning to your original value, which you can do by multiplying by i with imaginary numbers). It seems kind of arbitrary to choose real numbers. For a fun and interesting read, see the following link. It makes the point better than I can:

Negative numbers aren’t easy. Imagine you’re a European mathematician in the 1700s. You have 3 and 4, and know you can write 4 – 3 = 1. Simple.

But what about 3-4? What, exactly, does that mean? How can you take 4 cows from 3? How could you have less than nothing?

Negatives were considered absurd, something that “darkened the very whole doctrines of the equations” (Francis Maseres, 1759). Yet today, it’d be absurd to think negatives aren’t logical or useful. Try asking your teacher whether negatives corrupt the very foundations of math.

http://betterexplained.com/articles/a-visual-intuitive-guide-to-imaginary-numbers/

Imaginary numbers come up in the context of systems of uncertainty when we deal with quantum mechanics. The basic idea is that interactions operate over amplitudes (expressed as complex numbers), then to determine the likelihood of a final configuration, you look at norms of amplitudes. For a relatively straightforward explanation, see here: http://lesswrong.com/lw/pd/configurations_and_amplitude/

So I don’t necessarily have any well-formed thoughts on the matter (yet?), but it’s fun to think about other principled ways of representing uncertainty. I’m curious to know if there are types of interactions useful for machine learning that would be hard to represent with standard probability models but that would be aided by these types of quantum models.

Finally, I leave you with this blog comment from The Blog of Scott Aaronson:

“graphical models with amplitudes instead of probabilities” is a fair definition of a quantum circuit (and therefore a quantum computer).

http://scottaaronson.com/blog/?p=74#comment-1702

That seems to me, worth understanding deeper.