Information Geometry is applying differential geometry to families of probability distributions, and so to statistical models. Information does however play two roles in it: Kullback-Leibler information, or relative entropy, features as a measure of divergence (not quite a metric, because it’s asymmetric), and Fisher information takes the role of curvature.

One very nice thing about information geometry is that it gives us very strong tools for proving results about statistical models, simply by considering them as well-behaved geometrical objects. Thus, for instance, it’s basically a tautology to say that a manifold is not changing much in the vicinity of points of low curvature, and changing greatly near points of high curvature. Stated more precisely, and then translated back into probabilistic language, this becomes the Cramer-Rao inequality, that the variance of a parameter estimator is at least the reciprocal of the Fisher information. As someone who likes differential geometry, and now is interested in statistics, I find this very pleasing.

As a physicist, I have always been somewhat bothered by the way statisticians seem to accept particular parametrizations of their models as obvious and natural, and build those parameterization into their procedures. In linear regression, for instance, it’s reasonably common for them to want to find models with only a few non-zero coefficients. This makes my thumbs prick, because it seems to me obvious that if I regressed on arbitrary linear combinations of my covariates, I have exactly the same information (provided the transformation is invertible), and so I’m really looking at exactly the same model — but in general I’m not going to have a small number of non-zero coefficients any more. In other words, I want to be able to do coordinate-free statistics. Since differential geometry lets me do coordinate-free physics, information geometry seems like an appealing way to do this. There are various information-geometric model selection criteria, which I want to know more about; I suspect, based purely on this disciplinary prejudice, that they will out-perform coordinate-dependent criteria.

[From Information Geometry]

The following from the abstract of the tutorial given by Shun-ichi Amari, RIKEN Brain Science Institute in Algebraic Statistics 2012.

We give fundamentals of information geometry and its applications. We often treat a family of probability distributions for understanding stochastic phenomena in the world. When such a family includes n free parameters, it is parameterized by a real vector of n dimensions. This is regarded as a manifold, where the parameters play a role of a coordinate system. A natural question arises: What is the geometrical structure to be introduced in such a manifold. Geometrical structure gives, for example, a distance measure between two distributions and a geodesic line connecting two distributions. The second question is to know how useful the geometry is for understanding properties of statistical inference and designing new algorithms for inference.

The first question is answered by the invariance principle such that the geometry should be invariant under coordinate transformations of random variables. More precisely, it should be invariant by using sufficient statistics as random variables. It is surprising that this simple criterion gives a unique geometrical structure, which consists of a Riemannian metric and a family of affine connections which define geodesics.

The unique Riemannian metric is proved to be the Fisher information matrix. The invariant affine connections are not limited to the Riemannian (Levi-Civita) connection but include the exponential and mixture connections, which are dually coupled with respect to the metric. The connections are dually flat in typical cases such as exponential and mixture families.

A dually flat manifold has a canonical divergence function, which in our case is the Kullback-Leibler divergence. This implies that the KL-divergence is induced from the geometrical flatness. Moreover, there exist two affine coordinate systems, one is the natural or canonical parameters and the other is the expectation parameters in the case of an exponential family. They are connected by the Legendre transformation. A generalized Pythagorean theorem holds with respect to the canonical divergence and the pair of dual geodesics. A generalized projection theorem is derived from it.

These properties are useful for elucidating and designing algorithms of statistical inference. They are used not only for evaluating the higher-order characteristics of estimation and testing, but also for elucidating machine learning, pattern recognition and computer vision. We further study the procedures of semi-parametric estimation together with estimating functions. It is also applied to the analysis of neuronal spiking data by decomposing the firing rates of neurons and their higher-order correlations orthogonally.

The dually flat structure is useful for optimization of various kinds. A manifold needs not be connected with probability distributions. The invariance criterion does not work in such cases. However, a convex function plays a fundamental role in such a case, and we obtain a Riemannian metric together with two dually coupled flat affine connections, connected by the Legendre transformation. The Pythagorean theorem and projection theorem play again a fundamental role in applications of information geometry.