Recently some papers discussed in our journal club  are focused on integrative clustering of multiple omics data sets. I found that they are all originated from factor analysis and make use of the advantage of factor analysis over principal component analysis.

Let’s recall the model for factor analysis:

$X=\mu+LF+\epsilon,$

where $X\in{R}^p, \mu\in{R}^p, L\in{R}^{p\times r}, F\in{R}^r$ ($r) and $\epsilon\in{R}^p$, with mean $\mu$ and loading matrix $L$ fixed, and factors $F\sim\text{N}(0, I_r), \epsilon\sim\text{N}(0, \Psi)$ with $\Psi$ diagonal. And we also assume that $F$ and $\epsilon$ are uncorrelated. Note that this model is just characterizing the covariance structure of the Gaussian random vector $X\sim\text{N}(\mu, LL^\intercal+\Psi)$. Now we need to think about the roles of the loading matrix and factors. In fact, we can think about this model in the following way: if we are given such a random vector, then $X$ is just what you see of this random vector in one coordinate system under which the components of $X$ are correlated; but if you look at the random vector in another coordinate system which is a linear transformation ($L$) of the original coordinate system, then you will see $F$, whose components are uncorrelated. That is, the randomness in $X$ and $F$ is the same but with different views.  With the observed sample with sample size $n$,

$X_i=\mu+LF_i+\epsilon_i, i=1, 2, \cdots, n,$

we can use the EM algorithm to get the MLE of the parameters $\mu, L, \Psi$ (note that you will find maximizing the likelihood directly is hard).

Now for principal component analysis, for clarity, we are going to differentiate the classical (non-probabilistic) principal component analysis and probabilistic principal component analysis. The classical principal component analysis actually has no statistical model. And the probabilistic principal component model is defined as the above factor analysis model with $\Psi=\sigma^2 I_p$ and $L$ is orthonormal. And people can show that as $\sigma^2\to 0$, it becomes the classical principal component analysis. Actually we know that PCA maximizes data variance captured by the low dimensional projection, or equivalently minimizes the reconstruction error under the $L_2$ norm of the projected data points with the original data, namely

$min\|X-LZ^\intercal\|_F^2, \text{ subject to } L\in{R}^{p\times r} \text{ orthonormal}$

where $X\in{R}^{p\times n}$ here is the data matrix, and $Z\in{R}^{n\times r}$. And we know that solution to this problem is through SVD of the sample covariance: $\hat{L}$  contains the $r$ eigenvectors corresponding to the largest $r$ eigenvalues. And $\hat{Z}^\intercal=\hat{L}^\intercal X$ are the projected data points. From this analysis, we could see that the difference between factor analysis and the classical principal component analysis is that PCA treats covariance and variance identically, while factor analysis is trying to model to covariance and variance separately. In fact, the $r$ principal components are chosen to capture as much variance as possible, but the $r$ latent variables in a factor analysis model are chosen to explain as much covariance as possible. (Note that all the correlations amongst the variables must be explained by the common factors; if we assume joint normality the observed variables will be conditionally independent given $F$.)

In applications, we just deal with the data matrix, $X\in{R}^{n\times p}$. And the loadings of the first principal component, as a vector, denoted as $\phi_1=(\phi_{11},\phi_{12},\cdots,\phi_{1p})^\intercal$, is a normalized vector, i.e. $\|\phi_1\|_2=1$, which makes $X\phi_1\in{R}^{n}$ have the largest variance. And we call $Z_1=X\phi_1\in{R}^{n}$ as the first principal component score vector. In the common R package “prcomp” for the principal component analysis, the following command out=prcomp(X, scale=TRUE), can give us the $p\times r$ loading matrix by referring to out$rotation, and the $n\times r$ score matrix by referring to out$x.  That is the loading matrix is $\phi=(\phi_1,\phi_2,\cdots,\phi_r)\in{R}^{p\times r}$ and the score matrix is $Z=X\phi\in{R}^{n\times r}$, which is the $n$ realizations of the factors $F$.

Now think about what is the difference between the factor analysis and the probabilistic principal component analysis (PPCA). From the above definition, we see that the main difference is that factor analysis allow individual characteristics through the error term by $\epsilon\sim\text{N}(0, \Psi)$ instead of $\Psi=\sigma^2 I_p$. In this perspective, we have

$X=\mu+LF+\epsilon,$

with common structure $\mu+LF$ across all components of $X$ and individual characteristics $\epsilon_j\sim\text{N}(0, \psi_j)$. While PPCA does not allow any individual characteristics by assuming $\psi_j=\sigma^2$ for all $j$. This essential difference will make factor analysis more useful in integrative data analysis since it has more flexibility.

The AOAS 2013 paper is exactly using the above idea for modeling the integrative clustering:

$X_t=L_tF+\epsilon_t, t=1, 2, \cdots, T,$

where $X_t\in{R}^{p_t}$ with $T$ data sources. By stacking all the data sources together, we have

$X=LF+\epsilon.$

which is exactly a simple factor analysis. And this factor analysis model is more useful than PCA in this data integration setup just due to the allowing of individual characteristics for different data sources through $\epsilon$. And their paper is also dealing with sparsity in $L_t$.

The 2014 arXived paper is just generalizing the above paper by allowing another layer of individual characteristis:

$X_t=L_tF+W_tZ_t+\epsilon_t, t=1, 2, \cdots, T,$

But the problem for this one is how to do the estimation. Instead of using EM algorithm as used in the AOAS 2013 paper, they used the one as in the PCA by minimizing the reconstruction error.

On this Tuesday, Professor Xuming He presented their recent work on subgroup analysis, which is very interesting and useful in reality. Think about the following very much practical problem (since the drug is expensive or has certain amount of side effect):

If you are given the drug response, some baseline covariates which have nothing to do with the treatment, and the treatment indicator as well as some post-treatment measurements, how could you come up with a statistical model to tell whether there exist subgroups which respond to the treatment differently?

Think about 5 minutes and continue the following!

Dr He borrowed a very traditional model in Statistics, logistic-normal mixture model to study the above problem. The existence of the two subgroups is characterized by the observed baseline covariates, which have nothing to do with the treatment:

$P(\delta_i=1)=\pi(X_i^\intercal\gamma),$

where $\delta_i$ is the unobserved membership index. And the observed response follows a normal mixture model

$Y_i=Z_i^\intercal(\beta_1+\beta_2\delta_i)+\epsilon_i,$

with different means $Z_i^\intercal\beta_1$ and $Z_i^\intercal(\beta_1+\beta_2)$, where $Z_i$ usually contains $X_i$ but also includes the treatment indicator as well as any post-treatment measurements. Given that there is two subgroups characterized by the baseline covariates (which makes the test problem regular), they tried to test whether the two groups respond to the treatment differently, that is testing the component of $\beta_2$ which corresponds to the treatment indicator.

Nice work to demonstrate how to come up with a statistical model to study some interesting and practical problems!

But the above part has nothing to do with the title, EM algorithm. Actually you could imagine that they will use EM as a basic tool to study the above mixture model. That’s why I came back to revisit this great idea in Statistics.

Given complete random vector $Y=(X, Z)$ with $X$ observed and $Z$ unobserved, we have the likelihood function $p(y;\theta)$. Then the log marginal likelihood has the following property:

$\log p(x;\theta)=\log \int p(x,z;\theta) dz=\log \int \frac{p(x,z;\theta)}{f(z)}f(z)dz$

$\geq\int \log\Big\{\frac{p(x,z;\theta)}{f(z)}\Big\}f(z)dz,$

where the last inequality is from Jensen’s inequality, and $f(\cdot)$ is any density function put on $Z$. In order to make the bound tight, i.e. to make the above inequality as equality, one possible way is $f(z)\propto p(x,z;\theta)$, which leads to

$f(z)=p(x,z;\theta)/\int p(x,z;\theta) dx=p(z|x;\theta).$

Then we have

$\hat\theta=\arg\max\log p(x;\theta)=\arg\max E_f\Big(\log\{p(x,z;\theta)/f(z)\} \Big)$

$=\arg\max E_f\Big(\log p(x,z;\theta)\Big).$

In summary, we have the following EM procedure:

1. E step: get the conditional distribution $f(z)=p(z|x;\theta)$;
2. M step: $\hat\theta=\arg\max E_f\Big(\log p(x,z;\theta)\Big)$

And the corresponding EM algorithm can be described as the following iterative procedure:

1. E step: get the conditional distribution $f^{(k)}(z)=p(z|x;\hat\theta^{(k)})$;
2. M step: $\hat\theta^{(k+1)}=\arg\max E_{f^{(k)}}\Big(\log p(x,z;\theta)\Big)$

In order to make this procedure effective, in the M step, the condition expectation should be easy to calculate. In fact, usually, since the expectation will be taken under the current $\hat\theta^{(k)}$, which will not produce any new $\theta$, we usually get $\hat\theta(x,z)=\arg\max\log p(x,z;\theta)$ first and then by plugging we have $\hat\theta=\hat\theta\big(x,E_{f^{(k)}}(z)\big)$.

And this procedure guarantees the following to make sure the convergence

$\ell(\hat\theta^{(k+1)})\geq \int \log\Big\{\frac{p(x,z;\hat\theta^{(k+1)})}{f^{(k)}(z)}\Big\}f^{(k)}(z)dz$

$\geq \int \log\Big\{\frac{p(x,z;\hat\theta^{(k)})}{f^{(k)}(z)}\Big\}f^{(k)}(z)dz =\ell(\hat\theta^{(k)}).$

In summary, EM algorithm is useful when the marginal problem $\arg\max\log p(x;\theta)$ is difficult while the joint problem $\arg\max\log p(x, z;\theta)$ is easy. However $Z$ is unobservable, and the EM algorithm attempts to maximize $\log p(x, z;\theta)$ iteratively, by replacing it with its conditional expectation given the observed data. This expectation is computed with respect to the distribution of the complete data evaluated at the current estimate of $\theta$.

In the talk given by Professor Xuming He, he mentioned a rule of thumb from practice experience that the EM algorithm produces a good enough estimator in the first few steps.

The core idea of Empirical Likelihood (EL) is to use a maximum entropy discrete distribution supported on the data points and constrained by estimating equations related with the parameters of interest. As such, it is a non-parametric approach in the sense that the distribution of the data does not need to be specified, only some of its characteristics usually via moments. In short, it’s a non-parametric likelihood, which is fundamental for the likelihood-based statistical methodology.

Bayesian Analysis is a very popular and useful method in applications. As we discussed in the last post, it’s essentially an belief updating procedure through data, which is very natural in modeling. Last time, I said I did not get why there is a severe debate between Frequentist and Bayesian. Yesterday, I had a nice talk with Professor Xuming He from University of Michigan. When we talked about the Bayesian analysis, he made a nice point that in Frequentist analysis, the model mis-specification can be addressed in a very rigorous way to conduct valid statistical inference; while in Bayesian analysis, it is very sensitive to the likelihood as well as the prior, but how to do the adjustment is a big problem (here is a paper discussing model misspecification problems under Bayesian framework).

Before the discussion with Dr. Xuming He, intuitively, I thought it’s very natural and potentially very useful to combine empirical likelihood with Bayesian analysis by regarding empirical likelihood as the likelihood used in the Bayesian framework. But now I got to understand the importance of why Professor Nicole Lazar from University of Georgia had a paper on “Bayesian Empirical Likelihood” to discuss the validity of posterior inference: “…can likelihoods other than the density from which the data are assumed to be generated be used as the likelihood portion in a Bayesian analysis?” And the paper concluded that “…while they indicate that it is feasible to consider a Bayesian inferential procedure based on replacing the data likelihood with empirical likelihood, the validity of the posterior inference needs to be established for each case individually.”

But Professor Xuming He made a nice comment that Bayesian framework can be used to avoid the calculation of maximum empirical likelihood estimator by proving of the asymptotically normal posterior distribution with mean around the maximum empirical likelihood estimator. The original idea of their AOS paper was indeed to use the computational advantage from Bayesian side to solve the optimization difficulty in getting maximum empirical likelihood estimator. This reminded me of another paper about “Approximate Bayesian Computation (ABC) via Empirical Likelihood“, which used empirical likelihood to get improvement in the approximation at an overall computing cost that is negligible against ABC.

We know that for general Bayesian analysis, the goal is to be able to simulate from the posterior distribution by using MCMC for example. But in order to use MCMC to simulate from the posterior distribution, we need to be able to evaluate the likelihood. But sometimes, it’s hard to evaluate the likelihood due to the complexity of the model. For example, recently laundry socks problem is a hit online. Since it’s not that trivial to figure out the the likelihood of the process although we have a simple generative model from which we can easily simulate samples, Professor Rasmus Bååth presented a Bayesian analysis by using ABC. Later Professor Christian Robert presented exact ptobability calculations pointing out that Feller had posed a similar problem. And here is another post from Professor Saunak Sen. The basic idea for ABC approximation is to accept values provided the simulated sample is sufficiently close to the observed data point:

1. Simulate $\theta\sim\pi(\cdot)$, where $\pi$ is the prior;
2. Simulate $x$ from the generative model;
3. If $\|x-x^*\|$ is small, keep $\theta$, where $x^*$ is observed data point. Otherwise reject.

Now how to use empirical likelihood to help ABC? Actually although the original motivation is the same, that is to approximate the likelihood (ABC approximate the likelihood via simulation and empirical likelihood version of ABC is to use empirical likelihood to approximate the true likelihood), it’s more natural to start from the original Bayesian computation (this is also why Professor Christian Robert changed the title of their paper). For the posterior sample, we can generate as the following from the importance sampling perspective:

1. Simulate $\theta\sim\pi(\cdot)$, where $\pi$ is the prior;
2. Get the corresponding importance weight as $w=f(\theta|x)$ where $f(\theta|x)$ is the likelihood.

Now if we do not know the likelihood, we can do the following:

1. Simulate $\theta\sim\pi(\cdot)$, where $\pi$ is the prior;
2. Get the corresponding importance weight as $w=EL(\theta|x)$ where $EL(\theta|x)$ is the empirical likelihood.

This is the way of doing Bayesian computation via empirical likelihood.

The main difference between Bayesian computation via empirical likelihood and Empirical likelihood Bayesian is that the first one use empirical likelihood to approximate the likelihood in the Bayesian computation and followed by Bayesian inference, while the second one is that use Bayesian computation to overcome the optimization difficulty and followed by studying of the frequentist property.

updated[4/28/2015]: Here is a nice post talking about the issue for the Bayesian, especially for the model misspecification.

Last night, I had a discussion about the integrative data analysis (closely related with the discussion of AOAS 2014 paper from Dr Xihong Lin’s group and JASA 2014 paper from Dr. Hongzhe Li’s group) with my friend. If some biologist gave you the genetic variants (e.g. SNP) data and the phenotype (e.g. some trait) data, you were asked to do the association analysis to identify the genetic variants which is significantly associated with the trait. One year later, the biologist got some additional data such as gene expression data which are related with the two data sets given before, and you are now asked to calibrate your analysis to detect the association more efficiently and powerfully by integrating the three data sources. In this data rich age, it’s quite natural to get into this situation in practice. The question is how to come up with a natural and useful statistical framework to deal with such data integration.

For simplicity, we considered the problem that if you are first given two random variables, $X, Y$ to study the association between them. Later on you are given another random variable $Z$ to help to detect the significance association between $X$ and $Y$. We assume the following true model:

$Y=\beta X+\epsilon,$

where $X$ is independent with $\epsilon$. Now the question is what is the characteristic for $Z$ to be helpful to raise the power for the detection.

• What if $X$ and $Z$ are uncorrelated? If they are uncorrelated, then what if $Y$ and $Z$ are uncorrelated?
• What if $X$ and $Z$ are correlated?

After thinking about these, you will find that for $Z$ to be useful, it’s ideal that $Z$ is uncorrelated with $X$ and is highly correlated with $Y$, i.e. highly correlated with the error term $\epsilon$ so that it can be used to explain more variation contained in $Y$ to reduce the noise level.

In order to see why, first notice that the problem exactly depends on how to understand the following multiple linear regression problem:

$Y=\alpha X+ \gamma Z+\varepsilon.$

Now from the multiple linear regression knowledge, we have

$\beta=\alpha+\gamma\times\delta$

where $Z=\delta X+\eta$ (see below for the proof). Thus in order to raise the signal to noise ratio, we hope that $\alpha=\beta$, i.e. $\gamma=0$ or $\delta=0$, which can keep the signal large. But in order to reduce the noise, we need $\gamma\neq 0$. In summary, we need to have $\delta=0$, which means that $X$ and $Z$ are uncorrelated, and $\gamma\neq 0$, which means that $Z$ can be used to explain some variability contained in the noise.

What is the difference between doing univariate regression one by one and doing multiple linear regression all at once?

Here is some hint: first we regress $Y$ and $Z$ both onto $X$,

$E(Y|X)=\alpha X+\gamma\delta X, E(Z|X)=\delta X.$

And then on one hand we find that $\beta=\alpha+\gamma\delta$, and on the other hand we regress the residual $Y-E(Y|X)=\gamma\eta+\varepsilon$ onto the residual $Z-E(Z|X)=\eta$ to get $\gamma$ via

$Y-E(Y|X)=\gamma [Z-E(Z|X)]+\varepsilon.$

This procedure actually is explaining what is the multiple linear regression and what is the meaning for the coefficients (think about the meaning of $\gamma$ from the above explanation).

p-value and Bayes are the two hottest words in Statistics. Actually I still can not get why the debate between frequentist  statistics and Bayesian statistics can last so long. What is the essence arguments behind it? (Any one can help me with this?) In my point of view, they are just two ways for solving practical problems. Frequentist people are using the random version of proof-by-contradiction argument (i.e. small p-value indicates less likeliness for the null hypothesis to be true), while Bayesian people are using learning argument  to update their believes through data. Besides, mathematician are using partial differential equations (PDE) to model the real underlying process for the analysis. These are just different methodologies for dealing with practical problems. What’s the point for the long-last debate between frequentist  statistics and Bayesian statistics then?

Although my current research area is mostly in frequentist statistics domain, I am becoming more and more Bayesian lover, since it’s so natural. When I was teaching introductory statistics courses for undergraduate students at Michigan State University, I divided the whole course into three parts: Exploratory Data Analysis (EDA) by using R software, Bayesian Reasoning and Frequentist Statistics. I found that at the end of the semester, the most impressive example in my students mind was the one from the second section (Bayesian Reasoning).  That is the Monty Hall problem,  which was mentioned in the article that just came out in the NYT. (Note that about the argument from Professor Andrew Gelman, please also check out the response from Professor Gelman.) “Mr. Hall, longtime host of the game show “Let’s Make a Deal,” hides a car behind one of three doors and a goat behind each of the other two. The contestant picks Door No. 1, but before opening it, Mr. Hall opens Door No. 2 to reveal a goat. Should the contestant stick with No. 1 or switch to No. 3, or does it matter?” And the Bayesian approach to this problem “would start with one-third odds that any given door hides the car, then update that knowledge with the new data: Door No. 2 had a goat. The odds that the contestant guessed right — that the car is behind No. 1 — remain one in three. Thus, the odds that she guessed wrong are two in three. And if she guessed wrong, the car must be behind Door No. 3. So she should indeed switch.” What a natural argument! Bayesian babies and Google untrained search for youtube cats (the methods of deep learning) are all excellent examples proving that Bayesian Statistics IS a remarkable way for solving problems.

What about the p-values? This random version of proof-by-contradiction argument is also a great way for solving problems from the fact that it have been helping solve so many problems from various scientific areas, especially in bio-world. Check out today’s post from Simply Statistics: “You think P-values are bad? I say show me the data,” and also the early one: On the scalability of statistical procedures: why the p-value bashers just don’t get it.

The classical p-value does exactly what it says. But it is a statement about what would happen if there were no true effect. That can’t tell you about your long-term probability of making a fool of yourself, simply because sometimes there really is an effect. You make a fool of yourself if you declare that you have discovered something, when all you are observing is random chance. From this point of view, what matters is the probability that, when you find that a result is “statistically significant”, there is actually a real effect. If you find a “significant” result when there is nothing but chance at play, your result is a false positive, and the chance of getting a false positive is often alarmingly high. This probability will be called “false discovery rate” (or error rate), which is different with the concept in the multiple comparison. One possible misinterpretation of p-value is regarding p-value as the false discovery rate, which may be much higher than p-value. Think about the Bayes formula and the tree diagram you learned in introductory course to statistics to figure out the relationship between p-value and the “false discovery rate”.

I collected the following series on applying for faculty positions in 2011, when I was in my second year PhD. Now it’s my turn to apply for jobs. I will share the following useful materials with all you who want to apply for jobs this year.

My academic homepage just has been launched. Welcome to visit: Honglang Wang’s Homepage.