You are currently browsing the monthly archive for November 2011.

1, Is a PhD worth it in machine learning?

It’s about the puzzle of a statistics graduate who wants to have a degree in Machine learning.

2, What is the use of manifold learning?

Why we need manifold learning? What’s the biggest advantage of manifold machine learning?

3, Is the p-value a good measure of evidence?

Statistics abounds criteria for assessing quality of estimators, tests, forecasting rules, classification algorithms, but besides the likelihood principle discussions, it seems to be almost silent on what criteria should a good measure of evidence satisfy.” M. Grendár

4, What’s the relationship between Principal Component Analysis (PCA) and Ordinary Least Squares (OLS)?

Today I got a good sense of two important things in Statistics, one is Linear Mixed Model (LMM) and the other is Markov Chain and Monte Carlo (MCMC).

LMM:

Linear mixed models (LMM) handle data where observations are not independent. That is, LMM correctly models correlated errors, whereas procedures in the general linear model family (GLM, which includes t-tests, analysis of variance, correlation, regression, and factor analysis) usually do not. LMM is a further generalization of GLM to better support analysis of a continuous dependent for Random effects,Hierarchical effects and Repeated measures.
LMM is required whenever the OLS regression assumption of independent error is violated, as it often is whenever data cluster by some grouping variable (ex., scores nested within schools) or by some repeated measure (ex., yearly scores nested by student id).
There are many varieties of LMM, involving diverse labels: random intercept models, random coefficients models, hierarchical linear models, variance components models, covariance components models, and multilevel models, to name a few. While most multilevel modeling is univariate (one dependent variable), multivariate multilevel modeling for two or more dependent variables is available also.

MCMC:

The big idea behind MCMC is this: rather than sampling from a hard-to-sample target distribution p(.), we will sample from a Markov chain that has p(.) as its stationary distribution. The transition probability of this Markov chain should be a distribution that is easier to sample from. Intuitively, if we generate enough samples from the Markov chain, we will effectively sample from the target distribution.

A haplotype is a group of genes within an organism that was inherited together from a single parent. The word “haplotype” is derived from the word “haploid,” which describes cells with only one set of chromosomes, and from the word “genotype,” which refers to the genetic makeup of an organism. A haplotype can describe a pair of genes inherited together from one parent on one chromosome, or it can describe all of the genes on a chromosome that were inherited together from a single parent. This group of genes was inherited together because of genetic linkage, or the phenomenon by which genes that are close to each other on the same chromosome are often inherited together. In addition, the term “haplotype” can also refer to the inheritance of a cluster of single nucleotide polymorphisms (SNPs), which are variations at single positions in the DNA sequence among individuals.

By examining haplotypes, scientists can identify patterns of genetic variation that are associated with health and disease states. For instance, if a haplotype is associated with a certain disease, then scientists can examine stretches of DNA near the SNP cluster to try to identify the gene or genes responsible for causing the disease.

Over the course of many generations, segments of the ancestral chromosomes in an interbreeding population are shuffled through repeated recombination events. Some of the segments of the ancestral chromosomes occur as regions of DNA sequences that are shared by multiple individuals (Figure 1). These segments are regions of chromosomes that have not been broken up by recombination, and they are separated by places where recombination has occurred. These segments are the haplotypes that enable geneticists to search for genes involved in diseases and other medically important traits.

多代之后,经过反复的重组事件,祖先染色体的片段的原有排布在非近亲结婚的人群中已被打乱。某些祖先染色体片段会在许多后代个体的DNA序列中出现。这些是没有被重组打破的区段,相互间被那些发生了重组的区域隔开。这些区段就是单体型(haplotype),遗传学家利用它可以寻找与疾病或者其它医学上的重要特征相关的基因。

What causes loss of power in hypothesis testing?

Several days ago, I asked the above question in StackExchange. And the following is the list of answers:

There is an enormous literature on this subject; I’ll just give you a quick thumbnail sketch. Let’s say you’re testing groups’ mean differences, as in T-tests. Power will be reduced if…

  1. …variables are measured unreliably. This will in essence “fuzz over” the differences.
  2. …variability within groups is high. This will render between-group differences less noteworthy.
  3. …your criterion for statistical significance is strict, e.g., .001 as opposed to the more common .05.
  4. …you are using a one-tailed test, hypothesizing that if there is a difference, a certain group will have the higher mean. This cuts down on opportunistically significant findings that occur with two-tailed tests.
  5. …you are dealing with (or expecting) very slim mean differences in the first place. Small effect sizes decrease power.

If you google “power calculator” you’ll find some nifty sites that demonstrate the way these factors interact, depending on values you input. And if you download the free Gpower3 program you’ll have a nice tool for calculating power and considering the effects of different conditions–for a wide variety of statistical procedures.

Another one is:

Think of power as your ability to recognize the truthfulness of one of two competing data generating processes. It will obviously be easier when:

  1. The data have much signal with little noise (good SNR).
  2. The competing processes are very different one from the other (null and alternative are separated).
  3. You have ample amounts of data.
  4. You are not too concerned with mistakes (large type I error rate).

If the above do not hold, it will be harder to choose the “true” process. I.e., you will lose power.

These days I have seen some discussion about simplicity.  Simplicity is relative and requires context. Something is simple to you but maybe not to others. More extremely,  the thing that human beings regard as simple may be very complicated to Martians. Why do human beings think something simple? This is a result of natural selection. People could only manage simple things. But simplicity does not imply no depth, neither goodness. Instead, simple is good. The simple thing we can sensor is the most deepest and important thing, which could be very hard and complicated thing for Martians. Our ability to capture the simple thing is really the natural selection by the nature. It’s kind of the result of reinforcement machine learning. The human beings have paid a lot to get the ability to capture the good and essential things, which are simple for us.

 For example, consider this mnemonic for pi:

How I want a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.

This sentence is easier for most people to remember than 3.14159265358979.  But the sentence is also more complex. A computer can represent the number in 8 bytes but the sentence takes 94 bytes of ASCII, more in Unicode.

We need to reevaluate what we believe is simple. Maybe what we think is simple is complex but familiar. Maybe there is something new that is objectively simpler would become even easier once we’re used to it.

Blog Stats

  • 185,523 hits

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 518 other subscribers