On this Tuesday, Professor Xuming He presented their recent work on subgroup analysis, which is very interesting and useful in reality. Think about the following very much practical problem (since the drug is expensive or has certain amount of side effect):

If you are given the drug response, some baseline covariates which have nothing to do with the treatment, and the treatment indicator as well as some post-treatment measurements, how could you come up with a statistical model to tell whether there exist subgroups which respond to the treatment differently?

Think about 5 minutes and continue the following!

Dr He borrowed a very traditional model in Statistics, logistic-normal mixture model to study the above problem. The existence of the two subgroups is characterized by the observed baseline covariates, which have nothing to do with the treatment:

$P(\delta_i=1)=\pi(X_i^\intercal\gamma),$

where $\delta_i$ is the unobserved membership index. And the observed response follows a normal mixture model

$Y_i=Z_i^\intercal(\beta_1+\beta_2\delta_i)+\epsilon_i,$

with different means $Z_i^\intercal\beta_1$ and $Z_i^\intercal(\beta_1+\beta_2)$, where $Z_i$ usually contains $X_i$ but also includes the treatment indicator as well as any post-treatment measurements. Given that there is two subgroups characterized by the baseline covariates (which makes the test problem regular), they tried to test whether the two groups respond to the treatment differently, that is testing the component of $\beta_2$ which corresponds to the treatment indicator.

Nice work to demonstrate how to come up with a statistical model to study some interesting and practical problems!

But the above part has nothing to do with the title, EM algorithm. Actually you could imagine that they will use EM as a basic tool to study the above mixture model. That’s why I came back to revisit this great idea in Statistics.

Given complete random vector $Y=(X, Z)$ with $X$ observed and $Z$ unobserved, we have the likelihood function $p(y;\theta)$. Then the log marginal likelihood has the following property:

$\log p(x;\theta)=\log \int p(x,z;\theta) dz=\log \int \frac{p(x,z;\theta)}{f(z)}f(z)dz$

$\geq\int \log\Big\{\frac{p(x,z;\theta)}{f(z)}\Big\}f(z)dz,$

where the last inequality is from Jensen’s inequality, and $f(\cdot)$ is any density function put on $Z$. In order to make the bound tight, i.e. to make the above inequality as equality, one possible way is $f(z)\propto p(x,z;\theta)$, which leads to

$f(z)=p(x,z;\theta)/\int p(x,z;\theta) dx=p(z|x;\theta).$

Then we have

$\hat\theta=\arg\max\log p(x;\theta)=\arg\max E_f\Big(\log\{p(x,z;\theta)/f(z)\} \Big)$

$=\arg\max E_f\Big(\log p(x,z;\theta)\Big).$

In summary, we have the following EM procedure:

1. E step: get the conditional distribution $f(z)=p(z|x;\theta)$;
2. M step: $\hat\theta=\arg\max E_f\Big(\log p(x,z;\theta)\Big)$

And the corresponding EM algorithm can be described as the following iterative procedure:

1. E step: get the conditional distribution $f^{(k)}(z)=p(z|x;\hat\theta^{(k)})$;
2. M step: $\hat\theta^{(k+1)}=\arg\max E_{f^{(k)}}\Big(\log p(x,z;\theta)\Big)$

In order to make this procedure effective, in the M step, the condition expectation should be easy to calculate. In fact, usually, since the expectation will be taken under the current $\hat\theta^{(k)}$, which will not produce any new $\theta$, we usually get $\hat\theta(x,z)=\arg\max\log p(x,z;\theta)$ first and then by plugging we have $\hat\theta=\hat\theta\big(x,E_{f^{(k)}}(z)\big)$.

And this procedure guarantees the following to make sure the convergence

$\ell(\hat\theta^{(k+1)})\geq \int \log\Big\{\frac{p(x,z;\hat\theta^{(k+1)})}{f^{(k)}(z)}\Big\}f^{(k)}(z)dz$

$\geq \int \log\Big\{\frac{p(x,z;\hat\theta^{(k)})}{f^{(k)}(z)}\Big\}f^{(k)}(z)dz =\ell(\hat\theta^{(k)}).$

In summary, EM algorithm is useful when the marginal problem $\arg\max\log p(x;\theta)$ is difficult while the joint problem $\arg\max\log p(x, z;\theta)$ is easy. However $Z$ is unobservable, and the EM algorithm attempts to maximize $\log p(x, z;\theta)$ iteratively, by replacing it with its conditional expectation given the observed data. This expectation is computed with respect to the distribution of the complete data evaluated at the current estimate of $\theta$.

In the talk given by Professor Xuming He, he mentioned a rule of thumb from practice experience that the EM algorithm produces a good enough estimator in the first few steps.