Data Analysis using Computational Topology and Geometric Statistics

A standard paradigm assumes that the data comes from some underlying geometric structure, such as a curved submanifold or a singular algebraic variety. The observed data is obtained as a random sample from this space, and the objective is to statistically recover features of the underlying space and/or the distribution that generated the sample.

In Geometric Statistics one uses the underlying Riemannian structure to recover quantitative information concerning the probability distribution and/or functionals thereof. The idea is to extend statistical estimation techniques to functions over Riemannian manifolds, utilizing spectral methods adapted to the Riemannian structure.

One then considers the magnitude of the statistical accuracy of these estimators. Considerable progress has been achieved in terms of optimal estimation in the minimax sense. These ideas have far reaching implications in the analysis of high-dimensional data such as, for example, in astronomy, biomechanics, medical imaging, microwave engineering and texture analysis.

In Computational Topology, one attempts to recover more qualitative global features of the underlying data instead, such as connectedness, or the number of holes, or the existence of obstructions to certain constructions, based upon the random sample. In other words, one hopes to recover the underlying topology. An advantage of topology is that it is stable under deformations and thus insensitive to errors introduced in the sampling.

A combinatorial construction such as the alpha complex or the Cˇ ech complex converts the discrete data into an object for which it is possible to compute the topology. However, it is quickly apparent that such a construction and its calculated topology depend on the scale at which one considers the data. A multiscale solution to this problem is the technique of persistent homology. It quantifies the persistence of topological features as the scale changes. Persistent homology is useful for visualization, feature detection and object recognition. It has been successfully applied to analyze natural images, neurological data, gene-chip data, protein binding and sensor networks.

Although Geometric Statistics and Computational Topology have a disparate appearance and seem to have different objectives, it has recently been noticed that they share a commonality through statistical sampling. In particular it has been noticed that the metric distance of persistent homology in Computational Topology, is intimately related to the sup-norm metric between the underlying density that generates a random sample on a Riemannian manifold, and its statistical estimator. Consequently, the qualitative and quantitative data analyses are intimately linked, which is not surprising because of the close connection between geometry and topology traditionally.

The use of geometric and topological methods for statistical data analysis is currently being pursued in the three allied fields of computer science, mathematics and statistics. Although each field has their own particular approach and questions of interest, the amount of similarity is striking and this workshop was able to synthesize all three fields together. The open problems that were considered was the development of computational and statistical algorithms and methods using aspects of geometry and topology when data over the geometric object was only available.

We can summarize the type of investigations as it pertains to the aforementioned three fields. A more detailed description is provided in the following section:

  •  In computer science the pursuit naturally focused on efficient algorithms and visualization. Some specific items discussed included algorithms for the discrete approximation of the Laplacian, algorithms for approximating cut-locus, data reduction techniques, and recovery from noisy data;
  • In mathematics the interest focused on certain constructions. Here such topics included zigzag persistence, Hodge theory, and recovering the topology over a random field;
  • In statistics parameter estimation was the main interest and topics included bootstrapping and MCMC on manifolds, geodesic PCA, asymptotic minimaxity, conditional independence, statistical multiscale analysis and analysis over the Euclidean motion group.

Additionally, some physical applications were also discussed such as brain mapping, network analysis and biomechanics of osteoarthritis.