“A population consisting of an unknown number of distinct species is searched by selecting one member at a time. No a priori information is available concerning the probability that an object selected from this population will represent a particular species. Based on the information available after an n-stage search it is desired to predict the conditional probability that the next selection will represent a species not represented in the n-stage sample.”
Searcher: “I am contemplating extending my initial search an additional m stages, and will so do if the expected number of individuals I will select in the second search who are new species is large. What do you recommend?”
Statistician: “Make one more search and then I will tell you.”
Refer to the Annals of Statistics paper:
[1] Starr, Norman. “Linear estimation of the probability of discovering a new species.” The Annals of Statistics (1979): 644-652.
In statistics, we have to organize an experiment in order to gain some information about an object of interest. Fragments of this information can be obtained by making observations within some elementary experiments called trials. The set of all trials which can be incorporated in a prepared experiment will be denoted by , which we shall call the design space. The problem to be solved in experimental design is how to choose, say trials , called the support points of the design, or eventually how to choose the size of the design, to gather enough information about the object of interest. Optimum experimental design corresponds to the maximization, in some sense, of this information. In specific, the optimality of a design depends on the statistical model and is assessed with respect to a statistical criterion, which is related to the variance-matrix of the estimator. Specifying an appropriate model and specifying a suitable criterion function both require understanding of statistical theory and practical knowledge with designing experiments.
We shall restrict our attention to the parametric situation in the case of a regression model, the mean response function is then parameterized as
specifying for a particular with unknown parameter .
A design is specified by an initially arbitrary measure assigning design points to estimate the parameter vector. Here can be written as
where the design support points are elements of the design space , and the associated weights are nonnegative real numbers which sum to one. We make the usual second moment error assumptions leading to the use of least squares estimates. Then the corresponding Fisher information matrix associated with is given by
where and .
Now we have to propose the statistical criteria for the optimum. It is known that the least squares estimator minimizes the variance of mean-unbiased estimators (under the conditions of the Gauss–Markov theorem). In the estimation theory for statistical models with one real parameter, the reciprocal of the variance of an (“efficient”) estimator is called the “Fisher information” for that estimator. Because of this reciprocity, minimizing the variance corresponds to maximizing the information. When the statistical model has several parameters, however, the mean of the parameter-estimator is a vector and its variance is a matrix. The inverse matrix of the variance-matrix is called the “information matrix”. Because the variance of the estimator of a parameter vector is a matrix, the problem of “minimizing the variance” is complicated. Using statistical theory, statisticians compress the information-matrix using real-valued summary statistics; being real-valued functions, these “information criteria” can be maximized. The traditional optimality-criteria are invariants of the information matrix; algebraically, the traditional optimality-criteria are functionals of the eigenvalues of the information matrix.
Other optimality-criteria are concerned with the variance of predictions:
Now back to our example, because the asymptotic covariance matrix associated with the LSE of is proportional to , the most popular regression design criterion is D-optimality, where designs are sought to minimize the determinant of . And the standardized predicted variance function, corresponding to the G-optimality, is
and G-optimality seeks to minimize .
A central result in the theory of optimal design, the General Equivalence Theorem, asserts that the design that is D-optimal is also G-optimal and that
the number of parameters.
Now the optimal design for an interference model, professor Wei Zheng will talk about, considers the following model in the block designs with neighbor effects:
where is the treatment assigned to the plot in the -th position of the -th block, and
We seed the optimal design among designs , the set of all designs with blocks of size and with treatments.
I am not going into the details of the derivation of the optimal design for the above interference model. I just sketch the outline here. First of all we can write down the information matrix for the direct treatment effect , say . Let be the set of all possible block sequences with replacement, which is the design space. Then we try to find the optimal measure among the set to maximize for a given function satisfying the following three conditions:
A measure which achieves the maximum of among for any satisfying the above three conditions is said to be universally optimal. Such measure is optimal under criteria of A, D, E, T, etc. Thus we could imagine that all of the analysis is just linear algebra.
As a CS prof at MIT, I have had the privilege of working with some of the very best PhD students anywhere. But even here there are some PhDs that clearly stand out as *great*. I’m going to give two answers, depending on your interpretation of “great”.
For my first answer I’d select four indispensable qualities:
0. intelligence
1. curiosity
2. creativity
3. discipline and productivity
(interestingly, I’d say the same four qualities characterize great artists).
In the “nice to have but not essential” category, I would add
4. ability to teach/communicate with an audience
5. ability to communicate with peers
The primary purpose of PhD work is to advance human knowledge. Since you’re working at the edge of what we know, the material you’re working with is hard—you have to be smart enough to master it (intelligence). This is what qualifying exams are about. But you only need to be smart *enough*—I’ve met a few spectacularly brilliant PhD students, and plenty of others who were just smart enough. This didn’t really make a difference in the quality of their PhDs (though it does effect their choice of area—more of the truly brilliant go into the theoretical areas).
But intelligence is just a starting point. The first thing you actually have to *do* to advance human knowledge is ask questions about why things are the way they are and how they could be made better (curiosity). PhD students spend lots of time asking questions to which they don’t know the answer, so you’d better really enjoy this. Obviously, after you ask the questions you have to come up with the answers. And you have to be able to think in new directions to answer those questions (creativity). For if you can answer those questions using tried and true techniques, then they really aren’t research questions—they’re just things we already know for which we just haven’t gotten around to filling in the detail.
These two qualities are critical for a great PhD, but also lead to one of the most common failure modes: students who love asking questions and thinking about cool ways to answer them, but never actually *do* the work necessary to try out the answer. Instead, they flutter off to the next cool idea. So this is where discipline comes in: you need to be willing to bang your head against the wall for months (theoretician) or spend months hacking code (practitioner), in order to flesh out your creative idea and validate it. You need a long-term view that reminds you why you are doing this even when the fun parts (brainstorming and curiosity-satisfying) aren’t happening.
Communication skills are really valuable but sometimes dispensable. Your work can have a lot more impact if you are able to spread it to others who can incorporate it in their work. And many times you can achieve more by collaborating with others who bring different skills and insights to a problem. On the other hand, some of the greatest work (especially theoretical work) has been done by lone figures locked in their offices who publish obscure hard to read papers; when that work is great enough, it eventually spreads into the community even if the originator isn’t trying to make it do so.
My second answer is more cynical. If you think about it, someone coming to do a PhD is entering an environment filled with people who excel at items 0-5 in my list. And most of those items are talents that faculty can continue to exercise as faculty, because really curiosity, creativity, and communication don’t take that much time to do well. The one place where faculty really need help is on productivity: they’re trying to advance a huge number of projects simultaneously and really don’t have the cycles to carry out the necessary work. So another way to characterize what makes a great PhD student is
0. intelligence
1. discipline and productivity
If you are off the scale in your productivity (producing code, running interviews, or working at a lab bench) and smart enough to understand the work you get asked to do, then you can be the extra pair of productive hands that the faculty member desperately needs. Your advisor can generate questions and creative ways to answer them, and you can execute. After a few years of this, they’ll thank you with a PhD.
If all you want is the PhD, this second approach is a fine one. But you should recognize that in this case that advisor is *not* going to write a recommendation letter that will get you a faculty position (though they’ll be happy to praise you to Google). There’s only 1 way to be a successful *faculty member*, and that’s my first answer above.
Update: Here is another article from Professors Mark Dredze (Johns Hopkins University) and Hanna M. Wallach (University of Massachusetts Amherst).
Remarks from Professor Jordan: “not only do I think that you should eventually read all of these books (or some similar list that reflects your own view of foundations), but I think that you should read all of them three times—the first time you barely understand, the second time you start to get it, and the third time it all seems obvious.”
First of all, let’s define what’s the FPCA. Suppose we observe functions . We want to find an orthonormal basis such that
is minimized. Once such a basis is found, we can replace each curve by to a good approximation. This means instead of working with infinitely dimensional curves , we can work with $K-$dimensional vectors . And the functions are called collectively optimal empirical orthonormal basis, or empirical functional principal components. Note that once we got the functional principal components, we can get the so called FPC scores to approximate the curves.
For FPCA, we usually adopt the so called “smooth-first-then-estimate” approach, namely,we first pre-process the discrete observations to get smoothed functional data by smoothing and then use the empirical estimators of the mean and covariance based on the smoothed functional data to conduct FPCA.
For the smoothing step, we have to consider individual by individual. For each realization, we can use basis expansion (Polynomial basis is unstable; Fourier basis is suitable for periodic functions; B-spline basis is flexible and useful), smoothing penalties (which lead to smoothing splines by the Smoothing Spline Theorem), as well as local polynomial smoothing:
Once we have the smoothed functional data, denoted as , we can have the empirical estimator of the mean and covariance as
And then we can have the empirical functional principal components as the eigenfunctions of the above sample covariance operator (for the proof, refer to the book “Inference for Functional Data with Applications” page 39). Note that the above estimation procedure of the mean function and the covariance function need to have more dense functional data, since otherwise the smoothing step will be not stable. Thus people are also proposing some other estimators of mean function and covariance function, such as the local linear estimator for the mean function and the covariance function proposed by Professor Yehua Li from ISU, which has an advantage that they can cover all types of functional data, sparse (i.e. longitudinal), dense, or in-between. Now the problem is that how to conduct FPCA based on in practice. Actually it’s the following classic mathematical problem:
where is the integral operator with a symmetric kernel . This is a well-studied problem in computing the eigenvalues and eigenfunctions of an integral operator with a symmetric kernel in applied mathematics. So people can refer to those numerical methods to solve the above problem.
However, two common methods used in Statistics are described in Section 8.4 in the fundamental functional data analysis book written by Professors J. O. Ramsay and B. W. Silverman. One is the discretizing method and the other is the basis function method. For the discretizing method, essentially, we just discretize the smoothed functions to a fine grid of equally spaced values that span the interval, and then use the traditional PCA, followed by some interpolation method for other points not belonging to be selected grid points. Now for the basis function method, we illustrate it by assuming the mean function equal to 0:
,
where and . Hence the eigen problem boils down to , which is equivalent to
Note that the assumptions for the eigenfunctions to be orthonormal are equivalent to . Let , and then we have the above problem as
which is a traditional eigen problem for symmetric matrix .
Two special cases deserve particular attention. One is orthonormal basis which leads to . And the other is taking the smoothed functional data as the basis function which leads to .
Note that the empirical functional principal components are proved to be the eigenfunctions of the sample covariance operator. This fact connects the FPCA with the so called Karhunen-Love expansion:
where are uncorrelated random variables with mean 0 and variance $\lambda_k$ where . For simplicity we assume . Then we can easily see the connection between KL expansion and FPCA. is the series of orthonormal basis functions, and are those FPC scores.
So far, we only have discussed how to get the empirical functional principal components, i.e. eigenfunctions/orthonormal basis functions. But to represent the functional data, we have to get those coefficients, which are called FPC scores . The simplest way is by numerical integration:
Note that for the above estimation of the FPC scores via numerical integration, we first need the smoothed functional data . So if we only have sparsely observed functional data, this method will not provide reasonable approximations. Professor Fang Yao et al. proposed the so called PACE (principal component analysis through conditional expectation) to deal with such longitudinal data.
]]>
I often use fit criteria like AIC and BIC to choose between models. I know that they try to balance good fit with parsimony, but beyond that I’m not sure what exactly they mean. What are they really doing? Which is better? What does it mean if they disagree? — Signed, Adrift on the IC’s
Intuitively, the degrees of freedom of a fitting procedure reflects the effective number of parameters used by the fitting procedure. Thus to most applied statisticians, a fitting procedure’s degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. Is this really true? Regularization aims to improve prediction performance by trading an increase in training error for better agreement between training and prediction errors, which is often captured through decreased degrees of freedom. Is this always the case? When does more regularization imply fewer degrees of freedom?
For the above two questions, I think the most important thing is based on the following what-type question:
What are AIC and BIC? What is degrees of freedom?
Akaike’s Information Criterion (AIC) estimates the relative Kullback-Leibler (KL) distance of the likelihood function specified by a fitted candidate model, from the unknown true likelihood function that generated the data:
where is the likelihood function specified by a fitted candidate model, is the unknown true likelihood function, and the expectation is taking under the true model. Note that the fitted model closest to the truth in the KL sense would not necessarily be the model which best fits the observed sample since the observed sample can often be fit arbitrary well by making the model more and more complex. Since will be the same for all models being considered, KL is minimized by choosing the model with highest , which can be estimated by an approximately unbiased estimator (up to a constant)
where is an estimator for the covariance matrix of the parameters based on the second derivatives matrix of in the parameters and is an estimator based on the cross products of the first derivatives. Akaike showed that and are asymptotically equal for the true model, so that , which is the number of parameters. This results in the usual definition for AIC
Schwarz’s Bayesian Information Criterion (BIC) is just comparing the posterior probability with the same prior and hence just comparing the likelihoods under different models:
which is just the Bayes factor. Schwarz showed that in many kinds of models can be roughly approximated by
which leads to the definition of BIC
In summary, AIC and BIC are both penalized-likelihood criteria. AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. Both criteria are based on various assumptions and asymptotic approximations. Despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC. Thus, AIC always has a chance of choosing too big a model, regardless of n. BIC has very little chance of choosing too big a model if n is sufficient, but it has a larger chance than AIC, for any given n, of choosing too small a model.
The effective degrees of freedom for an arbitrary modelling approach is defined based on the concept of expected optimism:
where is the variance of the error term, is an independent copy of data vector with mean , and is a fitting procedure with tuning parameter . Note that the expected optimism is defined as . And by the optimism theorem, we have that
Why does this definition make sense? In fact, under some regularity conditions, Stein proved that
which can be regarded as a sensitivity measure of the fitted values to the observations.
In the linear model, we know that (Mallows) the relationship between the expected prediction error (EPE) and the residual sum of squares (RSS) follows
which leads to .
Here are some references on this topic: