Recently some papers discussed in our journal club  are focused on integrative clustering of multiple omics data sets. I found that they are all originated from factor analysis and make use of the advantage of factor analysis over principal component analysis.

Let’s recall the model for factor analysis:


where X\in{R}^p, \mu\in{R}^p, L\in{R}^{p\times r}, F\in{R}^r (r<p) and \epsilon\in{R}^p, with mean \mu and loading matrix L fixed, and factors F\sim\text{N}(0, I_r), \epsilon\sim\text{N}(0, \Psi) with \Psi diagonal. And we also assume that F and \epsilon are uncorrelated. Note that this model is just characterizing the covariance structure of the Gaussian random vector X\sim\text{N}(\mu, LL^\intercal+\Psi). Now we need to think about the roles of the loading matrix and factors. In fact, we can think about this model in the following way: if we are given such a random vector, then X is just what you see of this random vector in one coordinate system under which the components of X are correlated; but if you look at the random vector in another coordinate system which is a linear transformation (L) of the original coordinate system, then you will see F, whose components are uncorrelated. That is, the randomness in X and F is the same but with different views.  With the observed sample with sample size n,

X_i=\mu+LF_i+\epsilon_i, i=1, 2, \cdots, n,

we can use the EM algorithm to get the MLE of the parameters \mu, L, \Psi (note that you will find maximizing the likelihood directly is hard).

Now for principal component analysis, for clarity, we are going to differentiate the classical (non-probabilistic) principal component analysis and probabilistic principal component analysis. The classical principal component analysis actually has no statistical model. And the probabilistic principal component model is defined as the above factor analysis model with \Psi=\sigma^2 I_p and L is orthonormal. And people can show that as \sigma^2\to 0, it becomes the classical principal component analysis. Actually we know that PCA maximizes data variance captured by the low dimensional projection, or equivalently minimizes the reconstruction error under the L_2 norm of the projected data points with the original data, namely

min\|X-LZ^\intercal\|_F^2, \text{ subject to } L\in{R}^{p\times r} \text{ orthonormal}

where X\in{R}^{p\times n} here is the data matrix, and Z\in{R}^{n\times r}. And we know that solution to this problem is through SVD of the sample covariance: \hat{L}  contains the r eigenvectors corresponding to the largest r eigenvalues. And \hat{Z}^\intercal=\hat{L}^\intercal X are the projected data points. From this analysis, we could see that the difference between factor analysis and the classical principal component analysis is that PCA treats covariance and variance identically, while factor analysis is trying to model to covariance and variance separately. In fact, the r principal components are chosen to capture as much variance as possible, but the r latent variables in a factor analysis model are chosen to explain as much covariance as possible. (Note that all the correlations amongst the variables must be explained by the common factors; if we assume joint normality the observed variables will be conditionally independent given F.)

In applications, we just deal with the data matrix, X\in{R}^{n\times p}. And the loadings of the first principal component, as a vector, denoted as \phi_1=(\phi_{11},\phi_{12},\cdots,\phi_{1p})^\intercal, is a normalized vector, i.e. \|\phi_1\|_2=1, which makes X\phi_1\in{R}^{n} have the largest variance. And we call Z_1=X\phi_1\in{R}^{n} as the first principal component score vector. In the common R package “prcomp” for the principal component analysis, the following command out=prcomp(X, scale=TRUE), can give us the p\times r loading matrix by referring to out$rotation, and the n\times r score matrix by referring to out$x.  That is the loading matrix is \phi=(\phi_1,\phi_2,\cdots,\phi_r)\in{R}^{p\times r} and the score matrix is Z=X\phi\in{R}^{n\times r}, which is the n realizations of the factors F.

Now think about what is the difference between the factor analysis and the probabilistic principal component analysis (PPCA). From the above definition, we see that the main difference is that factor analysis allow individual characteristics through the error term by \epsilon\sim\text{N}(0, \Psi) instead of \Psi=\sigma^2 I_p. In this perspective, we have


with common structure \mu+LF across all components of X and individual characteristics \epsilon_j\sim\text{N}(0, \psi_j). While PPCA does not allow any individual characteristics by assuming \psi_j=\sigma^2 for all j. This essential difference will make factor analysis more useful in integrative data analysis since it has more flexibility.

The AOAS 2013 paper is exactly using the above idea for modeling the integrative clustering:

X_t=L_tF+\epsilon_t, t=1, 2, \cdots, T,

where X_t\in{R}^{p_t} with T data sources. By stacking all the data sources together, we have


which is exactly a simple factor analysis. And this factor analysis model is more useful than PCA in this data integration setup just due to the allowing of individual characteristics for different data sources through \epsilon. And their paper is also dealing with sparsity in L_t.

The 2014 arXived paper is just generalizing the above paper by allowing another layer of individual characteristis:

X_t=L_tF+W_tZ_t+\epsilon_t, t=1, 2, \cdots, T,

But the problem for this one is how to do the estimation. Instead of using EM algorithm as used in the AOAS 2013 paper, they used the one as in the PCA by minimizing the reconstruction error.