The first colloquium speaker at this semester, professor Wei Zheng from IUPUI, will give a talk on “Universally optimal designs for two interference models“. In this data explosive age, people are easy to get big data set, which renders people difficult to make inferences from such massive data. Since people usually think that with more data, they have more chance to get more useful information from them, lots of researchers are struggling to achieve methodological advancements under this setup. This is a very challenging research area and of course very important, which in my opinion needs the resurgence of mathematical statistics by borrowing great ideas from various mathematical fields. However, another great and classical statistical research area should come back again to help statistical inference procedures from the beginning stage of data analysis, collecting data by design of experiments so that we can control the data quality, usefulness and size. Thus it’s necessary for us to know what is optimal design of experiments. Here is an introduction to this interesting topic.

In statistics, we have to organize an experiment in order to gain some information about an object of interest. Fragments of this information can be obtained by making observations within some elementary experiments called trials. The set of all trials which can be incorporated in a prepared experiment will be denoted by $\mathcal{X}$ , which we shall call the design space. The problem to be solved in experimental design is how to choose, say $N$ trials $x_i\in\mathcal{X} , i = 1, \cdots, N$, called the support points of the design, or eventually how to choose the size $N$ of the design, to gather enough information about the object of interest. Optimum experimental design corresponds to the maximization, in some sense, of this information. In specific, the optimality of a design depends on the statistical model and is assessed with respect to a statistical criterion, which is related to the variance-matrix of the estimator. Specifying an appropriate model and specifying a suitable criterion function both require understanding of statistical theory and practical knowledge with designing experiments.

We shall restrict our attention to the parametric situation in the case of a regression model, the mean response function is then parameterized as

$E(Y)=\eta(x, \theta)$

specifying for a particular $x\in\mathcal{X}$ with unknown parameter $\theta\in{R}^p$.

A design is specified by an initially arbitrary measure $\xi(\cdot)$ assigning $n$ design points to estimate the parameter vector. Here $\xi$ can be written as

$\xi=\Big\{(x_1,w_1), (x_2,w_2), \cdots, (x_n, w_n)\Big\}$

where the $n$ design support points $x_1, x_2, \cdots, x_n$ are elements of the design space $\mathcal{X}$, and the associated weights $w_1, w_2, \cdots, w_n$ are nonnegative real numbers which sum to one. We make the usual second moment error assumptions leading to the use of least squares estimates. Then the corresponding Fisher information matrix associated with $\theta$ is given by

$M=M(\xi,\theta)=\sum_{i=1}^nw_i\frac{\partial\eta(x_i)}{\partial\theta}\frac{\partial\eta(x_i)}{\partial\theta^\intercal}=V^\intercal\Omega V$

where $V=\partial\eta/\partial\theta$ and $\Omega=diag\{w_1, w_2, \cdots, w_n\}$.

Now we have to propose the statistical criteria for the optimum. It is known that the least squares estimator minimizes the variance of mean-unbiased estimators (under the conditions of the Gauss–Markov theorem). In the estimation theory for statistical models with one real parameter, the reciprocal of the variance of an (“efficient”) estimator is called the “Fisher information” for that estimator. Because of this reciprocity, minimizing the variance corresponds to maximizing the information. When the statistical model has several parameters, however, the mean of the parameter-estimator is a vector and its variance is a matrix. The inverse matrix of the variance-matrix is called the “information matrix”. Because the variance of the estimator of a parameter vector is a matrix, the problem of “minimizing the variance” is complicated. Using statistical theory, statisticians compress the information-matrix using real-valued summary statistics; being real-valued functions, these “information criteria” can be maximized. The traditional optimality-criteria are invariants of the information matrix; algebraically, the traditional optimality-criteria are functionals of the eigenvalues of the information matrix.

• A-optimality (“average” or trace)
• One criterion is A-optimality, which seeks to minimize the trace of the inverse of the information matrix. This criterion results in minimizing the average variance of the estimates of the regression coefficients.
• D-optimality (determinant)
• A popular criterion is D-optimality, which seeks to maximize the determinant of the information matrix of the design. This criterion results in maximizing the differential Shannon information content of the parameter estimates.
• E-optimality (eigenvalue)
• Another design is E-optimality, which maximizes the minimum eigenvalue of the information matrix.
• T-optimality
• This criterion maximizes the trace of the information matrix.

Other optimality-criteria are concerned with the variance of predictions:

• G-optimality
• A popular criterion is G-optimality, which seeks to minimize the maximum entry in the diagonal of the hat matrix. This has the effect of minimizing the maximum variance of the predicted values.
• I-optimality (integrated)
• A second criterion on prediction variance is I-optimality, which seeks to minimize the average prediction variance over the design space.
• V-optimality (variance)
• A third criterion on prediction variance is V-optimality, which seeks to minimize the average prediction variance over a set of m specific points.

Now back to our example, because the asymptotic covariance matrix associated with the LSE of $\theta$ is proportional to $M^{-1}$, the most popular regression design criterion is D-optimality, where designs are sought to minimize the determinant of $M^{-1}$. And the standardized predicted variance function, corresponding to the G-optimality, is

$d(x,\xi,\theta)=V^\intercal(x)M^{-1}(\xi,\theta)V(x)$

and G-optimality seeks to minimize $\delta(\xi,\theta)=\max_{x\in\mathcal{X}}d(x,\xi,\theta)$.

A central result in the theory of optimal design, the General Equivalence Theorem, asserts that the design $\xi^*$ that is D-optimal is also G-optimal and that

$\delta(\xi^*,\theta)=p$

the number of parameters.

Now the optimal design for an interference model, professor Wei Zheng will talk about, considers the following model in the block designs with neighbor effects:

$y_{i,j}=\mu+\tau_{d(i,j)}+\lambda_{d(i,j-1)}+\rho_{d(i,j+1)}+\beta_i+e_{i,j}$

where $d(i,j)\in{1, 2, \cdots, t}$ is the treatment assigned to the plot $(i,j)$ in the $j$-th position of the $i$-th block, and

1. $\mu$ is the general mean;
2. $\tau_{d(i,j)}$ is the direct effect of treatment $d(i,j)$;
3. $\lambda_{d(i,j-1)}$ and $\rho_{d(i,j+1)}$ are respectively the left and right neighbor effects; that’s the interference effect of the treatment assigned to, respectively, the left and right neighbor plots $(i,j-1)$ and $(i,j+1)$.
4. $\beta_i$ is the effect of the $i$-th block; and
5. $e_{i,j}$ is the random error, $1\leq i\leq b, 1\leq j\leq k$.

We seed the optimal design among designs $d\in\Omega_{t,b,k}$, the set of all designs with $b$ blocks of size $k$ and with $t$ treatments.

I am not going into the details of the derivation of the optimal design for the above interference model. I just sketch the outline here. First of all we can write down the information matrix for the direct treatment effect $\tau=(\tau_1,\tau_2,\cdots, \tau_t)^\intercal$, say $C_d$. Let $S$ be the set of all possible $t^k$ block sequences with replacement, which is the design space. Then we try to find the optimal measure $\xi$ among the set $P=\{p_s, s\in S, \sum_sp_s=1, p_s\geq 0\}$ to maximize $\Phi(C_{\xi})$ for a given function $\Phi$ satisfying the following three conditions:

1. $\Phi$ is concave;
2. $\Phi(M^\intercal CM)=\Phi(C)$ for any permutation matrix $M$;
3. $\Phi(bC)$ is nondecreasing in the scalar $b>0$.

A measure $\xi$ which achieves the maximum of $\Phi(C_{\xi})$ among $P$ for any $\Phi$ satisfying the above three conditions is said to be universally optimal. Such measure is optimal under criteria of A, D, E, T, etc. Thus we could imagine that all of the analysis is just linear algebra.

I am graduating as a fifth year PhD student and I really agree with Professor David Karger from MIT about the qualities characterizing a great PhD student, especially about the point on “discipline and productivity”. Professor Karger also distinguished the difference between a successful PhD for industry and a successful PhD for academic. Here I just cite the whole article to share with you as well as to keep these principles in my own mind:

As a CS prof at MIT, I have had the privilege of working with some of the very best PhD students anywhere.  But even here there are some PhDs that clearly stand out as *great*.   I’m going to give two answers, depending on your interpretation of “great”.

For my first answer I’d select four indispensable qualities:
0. intelligence
1. curiosity
2. creativity
3. discipline and productivity
(interestingly, I’d say the same four qualities characterize great artists).

In the “nice to have but not essential” category, I would add
4. ability to teach/communicate with an audience
5. ability to communicate with peers

The primary purpose of PhD work is to advance human knowledge.  Since you’re working at the edge of what we know, the material you’re working with is hard—you have to be smart enough to master it (intelligence).  This is what qualifying exams are about.   But you only need to be smart *enough*—I’ve met a few spectacularly brilliant PhD students, and plenty of others who were just smart enough.  This didn’t really make a difference in the quality of their PhDs (though it does effect their choice of area—more of the truly brilliant go into the theoretical areas).

These two qualities are critical for a great PhD, but also lead to one of the most common failure modes: students who love asking questions and thinking about cool ways to answer them, but never actually *do* the work necessary to try out the answer.  Instead, they flutter off to the next cool idea.  So this is where discipline comes in: you need to be willing to bang your head against the wall for months (theoretician) or spend months hacking code (practitioner), in order to flesh out your creative idea and validate it.  You need a long-term view that reminds you why you are doing this even when the fun parts (brainstorming and curiosity-satisfying) aren’t happening.

Communication skills are really valuable but sometimes dispensable.  Your work can have a lot more impact if you are able to spread it to others who can incorporate it in their work.  And many times you can achieve more by collaborating with others who bring different skills and insights to a problem.  On the other hand, some of the greatest work (especially theoretical work) has been done by lone figures locked in their offices who publish obscure hard to read papers; when that work is great enough, it eventually spreads into the community even if the originator isn’t trying to make it do so.

My second answer is more cynical.  If you think about it, someone coming to do a PhD is entering an environment filled with people who excel at items 0-5 in my list.  And most of those items are talents that faculty can continue to exercise as faculty, because really curiosity, creativity, and communication don’t take that much time to do well.  The one place where faculty really need help is on productivity: they’re trying to advance a huge number of projects simultaneously and really don’t have the cycles to carry out the necessary work.   So another way to characterize what makes a great PhD student is

0. intelligence
1. discipline and productivity

If you are off the scale in your productivity (producing code, running interviews, or working at a lab bench) and smart enough to understand the work you get asked to do, then you can be the extra pair of productive hands that the faculty member desperately needs.  Your advisor can generate questions and creative ways to answer them, and you can execute.  After a few years of this, they’ll thank you with a PhD.

If all you want is the PhD, this second approach is a fine one.  But you should recognize that in this case that advisor is *not* going to write a recommendation letter that will get you a faculty position (though they’ll be happy to praise you to Google).  There’s only 1 way to be a successful *faculty member*, and that’s my first answer above.

Update: Here is another article from Professors Mark Dredze (Johns Hopkins University) and Hanna M. Wallach (University of Massachusetts Amherst).

There has been a Machine Learning (ML) reading list of books in hacker news for a while, where Professor Michael I. Jordan recommend some books to start on ML for people who are going to devote many decades of their lives to the field, and who want to get to the research frontier fairly quickly. Recently he articulated the relationship between CS and Stats amazingly well in his recent reddit AMA, in which he also added some books that dig still further into foundational topics. I just list them here for people’s convenience and my own reference.

• Frequentist Statistics
1. Casella, G. and Berger, R.L. (2001). “Statistical Inference” Duxbury Press.—Intermediate-level statistics book.
2. Ferguson, T. (1996). “A Course in Large Sample Theory” Chapman & Hall/CRC.—For a slightly more advanced book that’s quite clear on mathematical techniques.
3. Lehmann, E. (2004). “Elements of Large-Sample Theory” Springer.—About asymptotics which is a good starting place.
4. Vaart, A.W. van der (1998). “Asymptotic Statistics” Cambridge.—A book that shows how many ideas in inference (M estimation, the bootstrap, semiparametrics, etc) repose on top of empirical process theory.
5. Tsybakov, Alexandre B. (2008) “Introduction to Nonparametric Estimation” Springer.—Tools for obtaining lower bounds on estimators.
6. B. Efron (2010) “Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction” Cambridge,.—A thought-provoking book.
• Bayesian Statistics
1. Gelman, A. et al. (2003). “Bayesian Data Analysis” Chapman & Hall/CRC.—About Bayesian.
2. Robert, C. and Casella, G. (2005). “Monte Carlo Statistical Methods” Springer.—about Bayesian computation.
• Probability Theory
1. Grimmett, G. and Stirzaker, D. (2001). “Probability and Random Processes” Oxford.—Intermediate-level probability book.
2. Pollard, D. (2001). “A User’s Guide to Measure Theoretic Probability” Cambridge.—More advanced level probability book.
3. Durrett, R. (2005). “Probability: Theory and Examples” Duxbury.—Standard advanced probability book.
• Optimization
1. Bertsimas, D. and Tsitsiklis, J. (1997). “Introduction to Linear Optimization” Athena.—A good starting book on linear optimization that will prepare you for convex optimization.
2. Boyd, S. and Vandenberghe, L. (2004). “Convex Optimization” Cambridge.
3. Y. Nesterov and Iu E. Nesterov (2003). “Introductory Lectures on Convex Optimization” Springer.—A start to understand lower bounds in optimization.
• Linear Algebra
1. Golub, G., and Van Loan, C. (1996). “Matrix Computations” Johns Hopkins.—Getting a full understanding of algorithmic linear algebra is also important.
• Information Theory
1. Cover, T. and Thomas, J. “Elements of Information Theory” Wiley.—Classic information theory.
• Functional Analysis
1. Kreyszig, E. (1989). “Introductory Functional Analysis with Applications” Wiley.—Functional analysis is essentially linear algebra in infinite dimensions, and it’s necessary for kernel methods, for nonparametric Bayesian methods, and for various other topics.

Remarks from Professor Jordan: “not only do I think that you should eventually read all of these books (or some similar list that reflects your own view of foundations), but I think that you should read all of them three times—the first time you barely understand, the second time you start to get it, and the third time it all seems obvious.”

In mathematics, a general principle for studying an object is always from the study of the object itself to the study of the relationship between objects. In functional data analysis, the most important part for studying of the object itself, i.e. one functional data set, is functional principal component analysis (FPCA). And for the study of the relationship between two functional data sets, one popular way is various types of regression analysis. For this post, I only focus on the FPCA. The central idea of FPCA is dimension reduction by means of a spectral decomposition of the covariance operator, which yields functional principal components as coefficient vectors to represent the random curves in the sample.

First of all, let’s define what’s the FPCA. Suppose we observe functions $X_1(\cdot), X_2(\cdot), \cdots, X_n(\cdot)$. We want to find an orthonormal basis $\phi_1(\cdot), \cdots, \phi_K(\cdot)$ such that

$\sum_{i=1}^n\|X_i-\sum_{k=1}^K\phi_k\|^2$

is minimized. Once such a basis is found, we can replace each curve $X_i$ by $\sum_{k=1}^K\phi_k$ to a good approximation. This means instead of working with infinitely dimensional curves $X_i$, we can work with $K-$dimensional vectors $(, \cdots, )^\intercal$. And the functions $\phi_k$ are called collectively optimal empirical orthonormal basis, or empirical functional principal components. Note that once we got the functional principal components, we can get the so called FPC scores $$ to approximate the curves.

For FPCA, we usually adopt the so called “smooth-first-then-estimate” approach, namely,we first pre-process the discrete observations to get smoothed functional data by smoothing and then use the empirical estimators of the mean and covariance based on the smoothed functional data to conduct FPCA.

For the smoothing step, we have to consider individual by individual. For each realization, we can use basis expansion (Polynomial basis is unstable; Fourier basis is suitable for periodic functions; B-spline basis is flexible and useful), smoothing penalties (which lead to smoothing splines by the Smoothing Spline Theorem), as well as local polynomial smoothing:

• Basis expansion: by assuming one realization of the underlying true process $X_i(t)=\sum_{k=1}^Kc_{ik}B_k(t)$, where $\{B_k(\cdot), k=1,2,\cdots, K\}$ are the basis functions, we have the estimation

$min_{\{c_{ik}\}_{k=1}^K}\sum_{j=1}^{m_i}\big(Y_{ij}-\sum_{k=1}^Kc_{ik}B_k(t_{ij})\big)^2.$

• Smoothing penalties: $min\sum_{j=1}^{m_i}\big(Y_{ij}-X_i(t_{ij})\big)^2+\lambda J(X_i(\cdot))$, where $J(\cdot)$ is a measure for the roughness of functions.
• Local linear smoothing: assume at point $t$, we have $X_i(t)\approx X_i(t_0)+X_i'(t_0)(t-t_0)$, then we have $X_i(t_0)$ estimated by the following $\hat{a}$

$min_{a,b}\sum_{j=1}^{m_i}\big(Y_{ij}-a-b(t_{ij}-t_0)\big)^2K_h(t_{ij}-t_0).$

Once we have the smoothed functional data, denoted as $X_i(t), i=1,2,\cdots,n$, we can have the empirical estimator of the mean and covariance as

$\bar{X}(t)=\frac{1}{n}\sum_{i=1}^nX_i(t), \hat{C}(s,t)=\frac{1}{n}\sum_{i=1}^n\big(X_i(t)-\bar{X}(t)\big)\big(X_i(s)-\bar{X}(s)\big).$

And then we can have the empirical functional principal components as the eigenfunctions of the above sample covariance operator $\hat{C}$ (for the proof, refer to the book “Inference for Functional Data with Applications” page 39). Note that the above estimation procedure of the mean function and the covariance function need to have more dense functional data, since otherwise the smoothing step will be not stable. Thus people are also proposing some other estimators of mean function and covariance function, such as the local linear estimator for the mean function and the covariance function proposed by Professor Yehua Li from ISU, which has an advantage that they can cover all types of functional data, sparse (i.e. longitudinal), dense, or in-between. Now the problem is that how to conduct FPCA based on $\hat{C}(s,t)$ in practice. Actually it’s  the following classic mathematical problem:

$\int \hat{C}(s,t)\xi(s)ds=\lambda\xi(t), i.e. \hat{C}\xi=\lambda\xi,$

where $\hat{C}$ is the integral operator with a symmetric kernel $\hat{C}(s,t)$. This is a well-studied problem in computing the eigenvalues and eigenfunctions of an integral operator with a symmetric kernel in applied mathematics. So people can refer to those numerical methods to solve the above problem.

However, two common methods used in Statistics are described in Section 8.4 in the fundamental functional data analysis book written by Professors J. O. Ramsay and  B. W. Silverman. One is the discretizing method and the other is the basis function method. For the discretizing method, essentially, we just discretize the smoothed functions $X_i(t)$ to a fine grid of $N$ equally spaced values that span the interval, and then use the traditional PCA, followed by some interpolation method for other points not belonging to be selected grid points. Now for the basis function method, we illustrate it by assuming the mean function equal to 0:

1. Basis expansion: $X_i(t)=\sum_{k=1}^Ka_{ik}\phi_k(t)$, and then $X=(X_1,X_2,\cdots, X_n)^\intercal=A\phi$, where $A=((a_{ik}))\in{R}^{n\times K}$ and $\phi=(\phi_1,\phi_2,\cdots,\phi_K)^\intercal\in{R}^{K}$;
2. Covariance function: $\hat{C}(s,t)=\frac{1}{n}X^\intercal(s) X(t)=\frac{1}{n}\phi(s)^\intercal A^\intercal A\phi(t)$;
3. Eigenfunction expansion: assume the eigenfunction $\xi(t)=\sum_{k=1}^Kb_k\phi_k(t)$;
4. Problem Simplification: The above basis expansions lead to

$\hat{C}\xi=\int\frac{1}{n}\phi(s)^\intercal A^\intercal A\phi(t)\xi(t)dt= \frac{1}{n}\phi(s)^\intercal A^\intercal A\int\phi(t)\xi(t)dt$

$=\frac{1}{n}\phi(s)^\intercal A^\intercal A\sum_{k=1}^Kb_k\int\phi(t)\phi_k(t)dt=\frac{1}{n}\phi(s)^\intercal A^\intercal AWb$,

where $W=\int\phi(t)\phi^\intercal(t)dt\in{R}^{K\times K}$ and $b=(b_1, b_2, \cdots, b_K)^\intercal\in{R}^{K}$. Hence the eigen problem boils down to $\frac{1}{n}\phi(s)^\intercal A^\intercal AWb=\lambda\phi(s)^\intercal b, \forall s$, which is equivalent to

$\frac{1}{n}A^\intercal AWb=\lambda b.$

Note that the assumptions for the eigenfunctions to be orthonormal are equivalent to $b_i^\intercal W b_i=1, b_i^\intercal W b_j=0, i\neq j$. Let $u=W^{1/2}b$, and then we have the above problem as

$n^{-1}W^{1/2}A^\intercal AW^{1/2}W^{1/2}b=\lambda W^{1/2}b, i.e. n^{-1}W^{1/2}A^\intercal AW^{1/2}u=\lambda u$

which is a traditional eigen problem for symmetric matrix $n^{-1}W^{1/2}A^\intercal AW^{1/2}$.

Two special cases deserve particular attention. One is orthonormal basis which leads to $W=I$. And the other is taking the smoothed functional data $X_i(t)$ as the basis function which leads to $A=I$.

Note that the empirical functional principal components are proved to be the eigenfunctions of the sample covariance operator. This fact connects the FPCA with the so called Karhunen-Lo$\grave{e}$ve expansion:

$X_i(t)=\mu(t)+\sum_{k}\xi_{ik}\phi_k(t), Cov(X_i(t), X_i(s))=\sum_k\lambda_k\phi_k(s)\phi_k(s)$

where $\xi_{ik}$ are uncorrelated random variables with mean 0 and variance $\lambda_k$ where $\sum_k\lambda_k<\infty, \lambda_1\geq\lambda_2\geq\cdots$. For simplicity we assume $\mu(t)=0$. Then we can easily see the connection between KL expansion and FPCA. $\{\phi_k(\cdot)\}_{k}$ is the series of orthonormal basis functions, and $\{=\xi_{ik}, i=1, 2, \cdots, n, k=1, 2, \cdots\}$ are those FPC scores.

So far, we only have discussed how to get the empirical functional principal components, i.e. eigenfunctions/orthonormal basis functions. But to represent the functional data, we have to get those coefficients, which are called FPC scores $\{=\xi_{ik}, i=1, 2, \cdots, n, k=1, 2, \cdots\}$. The simplest way is by numerical integration:

$\hat\xi_{ik}=\int X_i(t)\hat\phi_k(t)dt.$

Note that for the above estimation of the FPC scores via numerical integration, we first need the smoothed functional data $X_i(t)$. So if we only have sparsely observed functional data, this method will not provide reasonable approximations. Professor Fang Yao et al. proposed the so called PACE (principal component analysis through conditional expectation) to deal with such longitudinal data.

Degrees of freedom and information criteria are two fundamental concepts in statistical modeling, which are also taught in introductory statistics courses. But what are the exact abstract definitions for them which can be used to derive specific calculation formula in different situations.

I often use fit criteria like AIC and BIC to choose between models. I know that they try to balance good fit with parsimony, but beyond that I’m not sure what exactly they mean. What are they really doing? Which is better? What does it mean if they disagree?Signed, Adrift on the IC’s

Intuitively, the degrees of freedom of a fitting procedure reflects the effective number of parameters used by the fitting procedure. Thus to most applied statisticians, a fitting procedure’s degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. Is this really true? Regularization aims to improve prediction performance by trading an increase in training error for better agreement between training and prediction errors, which is often captured through decreased degrees of freedom. Is this always the case? When does more regularization imply fewer degrees of freedom?

For the above two questions, I think the most important thing is based on the following what-type question:

What are AIC and BIC? What is degrees of freedom?

Akaike’s Information Criterion (AIC) estimates the relative Kullback-Leibler (KL) distance of the likelihood function specified by a fitted candidate model, from the unknown true likelihood function that generated the data:

$D_{KL}(L(y)||L_0(y))=\int L_0(y)\log \frac{L(y)}{L_0(y)}dy=E_0(l(y))-E_0(l_0(y))$

where $L(y)$ is the likelihood function specified by a fitted candidate model, $L_0(y)$ is the unknown true likelihood function, and the expectation $E_0$ is taking under the true model. Note that the fitted model closest to the truth in the KL sense would not necessarily be the model which best fits the observed sample since the observed sample can often be fit arbitrary well by making the model more and more complex. Since $E_0(l_0(y))$ will be the same for all models being considered, KL is minimized by choosing the model with highest $E_0(l(y))$, which can be estimated by an approximately unbiased estimator (up to a constant)

$l-tr(\hat{J}^{-1}\hat{K})$

where $\hat{J}$ is an estimator for the covariance matrix of the parameters based on the second derivatives matrix of $l$ in the parameters and $\hat{K}$ is an estimator based on the cross products of the first derivatives.  Akaike showed that $\hat{J}$ and $\hat{K}$ are asymptotically equal for the true model, so that $tr(\hat{J}^{-1}\hat{K})=tr(I)$, which is the number of parameters. This results in the usual definition for AIC

$AIC=-2l+2p.$

Schwarz’s Bayesian Information Criterion (BIC) is just comparing the posterior probability with the same prior and hence just comparing the likelihoods under different models:

$B=\frac{Pr(M_1|y)}{Pr(M_2|y)}=\frac{Pr(y|M_1)}{Pr(y|M_2)}$

which is just the Bayes factor. Schwarz showed that in many kinds of models $B$ can be roughly approximated by $\exp(l_1-\frac{1}{2}ln(n)p_1-l_2+\frac{1}{2}ln(n)p_2)$

which leads to the definition of BIC

$BIC=-2l+ln(n) p.$

In summary, AIC and BIC are both penalized-likelihood criteria. AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. Both criteria are based on various assumptions and asymptotic approximations. Despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC. Thus, AIC always has a chance of choosing too big a model, regardless of n. BIC has very little chance of choosing too big a model if n is sufficient, but it has a larger chance than AIC, for any given n, of choosing too small a model.

The effective degrees of freedom for an arbitrary modelling approach is defined based on the concept of expected optimism:

$df(\mu, \sigma^2, FIT_{\lambda})=\frac{1}{2\sigma^2}\Big\{E(\|y^*-\hat{y}^{(FIT_{\lambda})}\|_2^2)-E(\|y-\hat{y}^{(FIT_{\lambda})}\|^2_2)\Big\}$

where $\sigma^2$ is the variance of the error term, $y^*$ is an independent copy of data vector $y$ with mean $\mu$, and $FIT_{\lambda}$ is a fitting procedure with tuning parameter $\lambda$. Note that the expected optimism is defined as $w:=\frac{1}{n}\Big\{E(\|y^*-\hat{y}^{(FIT_{\lambda})}\|_2^2)-E(\|y-\hat{y}^{(FIT_{\lambda})}\|^2_2)\Big\}$. And by the optimism theorem, we have that

$df(\mu, \sigma^2, FIT_{\lambda})=\frac{1}{\sigma^2}\sum_{i=1}^n cov(\hat\mu_i, y_i).$

Why does this definition make sense? In fact, under some regularity conditions, Stein proved that

$df=E(\sum_{i=1}^n\frac{\partial\hat\mu_i}{\partial y_i})$

which can be regarded as a sensitivity measure of the fitted values to the observations.

In the linear model, we know that (Mallows) the relationship between the expected prediction error (EPE) and the residual sum of squares (RSS)  follows

$EPE=E(RSS)+2\sigma^2 p,$

which leads to $df=p$.

Here are some references on this topic:

1. Dziak, John J., et al. “Sensitivity and specificity of information criteria.” The Methodology Center and Department of Statistics, Penn State, The Pennsylvania State University (2012).
2. Janson, Lucas, Will Fithian, and Trevor Hastie. “Effective degrees of freedom: A flawed metaphor.” arXiv preprint arXiv:1312.7851 (2013).

Recently some papers discussed in our journal club  are focused on integrative clustering of multiple omics data sets. I found that they are all originated from factor analysis and make use of the advantage of factor analysis over principal component analysis.

Let’s recall the model for factor analysis:

$X=\mu+LF+\epsilon,$

where $X\in{R}^p, \mu\in{R}^p, L\in{R}^{p\times r}, F\in{R}^r$ ($r) and $\epsilon\in{R}^p$, with mean $\mu$ and loading matrix $L$ fixed, and factors $F\sim\text{N}(0, I_r), \epsilon\sim\text{N}(0, \Psi)$ with $\Psi$ diagonal. And we also assume that $F$ and $\epsilon$ are uncorrelated. Note that this model is just characterizing the covariance structure of the Gaussian random vector $X\sim\text{N}(\mu, LL^\intercal+\Psi)$. Now we need to think about the roles of the loading matrix and factors. In fact, we can think about this model in the following way: if we are given such a random vector, then $X$ is just what you see of this random vector in one coordinate system under which the components of $X$ are correlated; but if you look at the random vector in another coordinate system which is a linear transformation ($L$) of the original coordinate system, then you will see $F$, whose components are uncorrelated. That is, the randomness in $X$ and $F$ is the same but with different views.  With the observed sample with sample size $n$,

$X_i=\mu+LF_i+\epsilon_i, i=1, 2, \cdots, n,$

we can use the EM algorithm to get the MLE of the parameters $\mu, L, \Psi$ (note that you will find maximizing the likelihood directly is hard).

Now for principal component analysis, for clarity, we are going to differentiate the classical (non-probabilistic) principal component analysis and probabilistic principal component analysis. The classical principal component analysis actually has no statistical model. And the probabilistic principal component model is defined as the above factor analysis model with $\Psi=\sigma^2 I_p$ and $L$ is orthonormal. And people can show that as $\sigma^2\to 0$, it becomes the classical principal component analysis. Actually we know that PCA maximizes data variance captured by the low dimensional projection, or equivalently minimizes the reconstruction error under the $L_2$ norm of the projected data points with the original data, namely

$min\|X-LZ^\intercal\|_F^2, \text{ subject to } L\in{R}^{p\times r} \text{ orthonormal}$

where $X\in{R}^{p\times n}$ here is the data matrix, and $Z\in{R}^{n\times r}$. And we know that solution to this problem is through SVD of the sample covariance: $\hat{L}$  contains the $r$ eigenvectors corresponding to the largest $r$ eigenvalues. And $\hat{Z}^\intercal=\hat{L}^\intercal X$ are the projected data points. From this analysis, we could see that the difference between factor analysis and the classical principal component analysis is that PCA treats covariance and variance identically, while factor analysis is trying to model to covariance and variance separately. In fact, the $r$ principal components are chosen to capture as much variance as possible, but the $r$ latent variables in a factor analysis model are chosen to explain as much covariance as possible. (Note that all the correlations amongst the variables must be explained by the common factors; if we assume joint normality the observed variables will be conditionally independent given $F$.)

In applications, we just deal with the data matrix, $X\in{R}^{n\times p}$. And the loadings of the first principal component, as a vector, denoted as $\phi_1=(\phi_{11},\phi_{12},\cdots,\phi_{1p})^\intercal$, is a normalized vector, i.e. $\|\phi_1\|_2=1$, which makes $X\phi_1\in{R}^{n}$ have the largest variance. And we call $Z_1=X\phi_1\in{R}^{n}$ as the first principal component score vector. In the common R package “prcomp” for the principal component analysis, the following command out=prcomp(X, scale=TRUE), can give us the $p\times r$ loading matrix by referring to out$rotation, and the $n\times r$ score matrix by referring to out$x.  That is the loading matrix is $\phi=(\phi_1,\phi_2,\cdots,\phi_r)\in{R}^{p\times r}$ and the score matrix is $Z=X\phi\in{R}^{n\times r}$, which is the $n$ realizations of the factors $F$.

Now think about what is the difference between the factor analysis and the probabilistic principal component analysis (PPCA). From the above definition, we see that the main difference is that factor analysis allow individual characteristics through the error term by $\epsilon\sim\text{N}(0, \Psi)$ instead of $\Psi=\sigma^2 I_p$. In this perspective, we have

$X=\mu+LF+\epsilon,$

with common structure $\mu+LF$ across all components of $X$ and individual characteristics $\epsilon_j\sim\text{N}(0, \psi_j)$. While PPCA does not allow any individual characteristics by assuming $\psi_j=\sigma^2$ for all $j$. This essential difference will make factor analysis more useful in integrative data analysis since it has more flexibility.

The AOAS 2013 paper is exactly using the above idea for modeling the integrative clustering:

$X_t=L_tF+\epsilon_t, t=1, 2, \cdots, T,$

where $X_t\in{R}^{p_t}$ with $T$ data sources. By stacking all the data sources together, we have

$X=LF+\epsilon.$

which is exactly a simple factor analysis. And this factor analysis model is more useful than PCA in this data integration setup just due to the allowing of individual characteristics for different data sources through $\epsilon$. And their paper is also dealing with sparsity in $L_t$.

The 2014 arXived paper is just generalizing the above paper by allowing another layer of individual characteristis:

$X_t=L_tF+W_tZ_t+\epsilon_t, t=1, 2, \cdots, T,$

But the problem for this one is how to do the estimation. Instead of using EM algorithm as used in the AOAS 2013 paper, they used the one as in the PCA by minimizing the reconstruction error.

On this Tuesday, Professor Xuming He presented their recent work on subgroup analysis, which is very interesting and useful in reality. Think about the following very much practical problem (since the drug is expensive or has certain amount of side effect):

If you are given the drug response, some baseline covariates which have nothing to do with the treatment, and the treatment indicator as well as some post-treatment measurements, how could you come up with a statistical model to tell whether there exist subgroups which respond to the treatment differently?

Think about 5 minutes and continue the following!

Dr He borrowed a very traditional model in Statistics, logistic-normal mixture model to study the above problem. The existence of the two subgroups is characterized by the observed baseline covariates, which have nothing to do with the treatment:

$P(\delta_i=1)=\pi(X_i^\intercal\gamma),$

where $\delta_i$ is the unobserved membership index. And the observed response follows a normal mixture model

$Y_i=Z_i^\intercal(\beta_1+\beta_2\delta_i)+\epsilon_i,$

with different means $Z_i^\intercal\beta_1$ and $Z_i^\intercal(\beta_1+\beta_2)$, where $Z_i$ usually contains $X_i$ but also includes the treatment indicator as well as any post-treatment measurements. Given that there is two subgroups characterized by the baseline covariates (which makes the test problem regular), they tried to test whether the two groups respond to the treatment differently, that is testing the component of $\beta_2$ which corresponds to the treatment indicator.

Nice work to demonstrate how to come up with a statistical model to study some interesting and practical problems!

But the above part has nothing to do with the title, EM algorithm. Actually you could imagine that they will use EM as a basic tool to study the above mixture model. That’s why I came back to revisit this great idea in Statistics.

Given complete random vector $Y=(X, Z)$ with $X$ observed and $Z$ unobserved, we have the likelihood function $p(y;\theta)$. Then the log marginal likelihood has the following property:

$\log p(x;\theta)=\log \int p(x,z;\theta) dz=\log \int \frac{p(x,z;\theta)}{f(z)}f(z)dz$

$\geq\int \log\Big\{\frac{p(x,z;\theta)}{f(z)}\Big\}f(z)dz,$

where the last inequality is from Jensen’s inequality, and $f(\cdot)$ is any density function put on $Z$. In order to make the bound tight, i.e. to make the above inequality as equality, one possible way is $f(z)\propto p(x,z;\theta)$, which leads to

$f(z)=p(x,z;\theta)/\int p(x,z;\theta) dx=p(z|x;\theta).$

Then we have

$\hat\theta=\arg\max\log p(x;\theta)=\arg\max E_f\Big(\log\{p(x,z;\theta)/f(z)\} \Big)$

$=\arg\max E_f\Big(\log p(x,z;\theta)\Big).$

In summary, we have the following EM procedure:

1. E step: get the conditional distribution $f(z)=p(z|x;\theta)$;
2. M step: $\hat\theta=\arg\max E_f\Big(\log p(x,z;\theta)\Big)$

And the corresponding EM algorithm can be described as the following iterative procedure:

1. E step: get the conditional distribution $f^{(k)}(z)=p(z|x;\hat\theta^{(k)})$;
2. M step: $\hat\theta^{(k+1)}=\arg\max E_{f^{(k)}}\Big(\log p(x,z;\theta)\Big)$

In order to make this procedure effective, in the M step, the condition expectation should be easy to calculate. In fact, usually, since the expectation will be taken under the current $\hat\theta^{(k)}$, which will not produce any new $\theta$, we usually get $\hat\theta(x,z)=\arg\max\log p(x,z;\theta)$ first and then by plugging we have $\hat\theta=\hat\theta\big(x,E_{f^{(k)}}(z)\big)$.

And this procedure guarantees the following to make sure the convergence

$\ell(\hat\theta^{(k+1)})\geq \int \log\Big\{\frac{p(x,z;\hat\theta^{(k+1)})}{f^{(k)}(z)}\Big\}f^{(k)}(z)dz$

$\geq \int \log\Big\{\frac{p(x,z;\hat\theta^{(k)})}{f^{(k)}(z)}\Big\}f^{(k)}(z)dz =\ell(\hat\theta^{(k)}).$

In summary, EM algorithm is useful when the marginal problem $\arg\max\log p(x;\theta)$ is difficult while the joint problem $\arg\max\log p(x, z;\theta)$ is easy. However $Z$ is unobservable, and the EM algorithm attempts to maximize $\log p(x, z;\theta)$ iteratively, by replacing it with its conditional expectation given the observed data. This expectation is computed with respect to the distribution of the complete data evaluated at the current estimate of $\theta$.

In the talk given by Professor Xuming He, he mentioned a rule of thumb from practice experience that the EM algorithm produces a good enough estimator in the first few steps.

The core idea of Empirical Likelihood (EL) is to use a maximum entropy discrete distribution supported on the data points and constrained by estimating equations related with the parameters of interest. As such, it is a non-parametric approach in the sense that the distribution of the data does not need to be specified, only some of its characteristics usually via moments. In short, it’s a non-parametric likelihood, which is fundamental for the likelihood-based statistical methodology.

Bayesian Analysis is a very popular and useful method in applications. As we discussed in the last post, it’s essentially an belief updating procedure through data, which is very natural in modeling. Last time, I said I did not get why there is a severe debate between Frequentist and Bayesian. Yesterday, I had a nice talk with Professor Xuming He from University of Michigan. When we talked about the Bayesian analysis, he made a nice point that in Frequentist analysis, the model mis-specification can be addressed in a very rigorous way to conduct valid statistical inference; while in Bayesian analysis, it is very sensitive to the likelihood as well as the prior, but how to do the adjustment is a big problem (here is a paper discussing model misspecification problems under Bayesian framework).

Before the discussion with Dr. Xuming He, intuitively, I thought it’s very natural and potentially very useful to combine empirical likelihood with Bayesian analysis by regarding empirical likelihood as the likelihood used in the Bayesian framework. But now I got to understand the importance of why Professor Nicole Lazar from University of Georgia had a paper on “Bayesian Empirical Likelihood” to discuss the validity of posterior inference: “…can likelihoods other than the density from which the data are assumed to be generated be used as the likelihood portion in a Bayesian analysis?” And the paper concluded that “…while they indicate that it is feasible to consider a Bayesian inferential procedure based on replacing the data likelihood with empirical likelihood, the validity of the posterior inference needs to be established for each case individually.”

But Professor Xuming He made a nice comment that Bayesian framework can be used to avoid the calculation of maximum empirical likelihood estimator by proving of the asymptotically normal posterior distribution with mean around the maximum empirical likelihood estimator. The original idea of their AOS paper was indeed to use the computational advantage from Bayesian side to solve the optimization difficulty in getting maximum empirical likelihood estimator. This reminded me of another paper about “Approximate Bayesian Computation (ABC) via Empirical Likelihood“, which used empirical likelihood to get improvement in the approximation at an overall computing cost that is negligible against ABC.

We know that for general Bayesian analysis, the goal is to be able to simulate from the posterior distribution by using MCMC for example. But in order to use MCMC to simulate from the posterior distribution, we need to be able to evaluate the likelihood. But sometimes, it’s hard to evaluate the likelihood due to the complexity of the model. For example, recently laundry socks problem is a hit online. Since it’s not that trivial to figure out the the likelihood of the process although we have a simple generative model from which we can easily simulate samples, Professor Rasmus Bååth presented a Bayesian analysis by using ABC. Later Professor Christian Robert presented exact ptobability calculations pointing out that Feller had posed a similar problem. And here is another post from Professor Saunak Sen. The basic idea for ABC approximation is to accept values provided the simulated sample is sufficiently close to the observed data point:

1. Simulate $\theta\sim\pi(\cdot)$, where $\pi$ is the prior;
2. Simulate $x$ from the generative model;
3. If $\|x-x^*\|$ is small, keep $\theta$, where $x^*$ is observed data point. Otherwise reject.

Now how to use empirical likelihood to help ABC? Actually although the original motivation is the same, that is to approximate the likelihood (ABC approximate the likelihood via simulation and empirical likelihood version of ABC is to use empirical likelihood to approximate the true likelihood), it’s more natural to start from the original Bayesian computation (this is also why Professor Christian Robert changed the title of their paper). For the posterior sample, we can generate as the following from the importance sampling perspective:

1. Simulate $\theta\sim\pi(\cdot)$, where $\pi$ is the prior;
2. Get the corresponding importance weight as $w=f(\theta|x)$ where $f(\theta|x)$ is the likelihood.

Now if we do not know the likelihood, we can do the following:

1. Simulate $\theta\sim\pi(\cdot)$, where $\pi$ is the prior;
2. Get the corresponding importance weight as $w=EL(\theta|x)$ where $EL(\theta|x)$ is the empirical likelihood.

This is the way of doing Bayesian computation via empirical likelihood.

The main difference between Bayesian computation via empirical likelihood and Empirical likelihood Bayesian is that the first one use empirical likelihood to approximate the likelihood in the Bayesian computation and followed by Bayesian inference, while the second one is that use Bayesian computation to overcome the optimization difficulty and followed by studying of the frequentist property.