Last night, I had a discussion about the integrative data analysis (closely related with the discussion of AOAS 2014 paper from Dr Xihong Lin’s group and JASA 2014 paper from Dr. Hongzhe Li’s group) with my friend. If some biologist gave you the genetic variants (e.g. SNP) data and the phenotype (e.g. some trait) data, you were asked to do the association analysis to identify the genetic variants which is significantly associated with the trait. One year later, the biologist got some additional data such as gene expression data which are related with the two data sets given before, and you are now asked to calibrate your analysis to detect the association more efficiently and powerfully by integrating the three data sources. In this data rich age, it’s quite natural to get into this situation in practice. The question is how to come up with a natural and useful statistical framework to deal with such data integration.
For simplicity, we considered the problem that if you are first given two random variables, to study the association between them. Later on you are given another random variable to help to detect the significance association between and . We assume the following true model:
where is independent with . Now the question is what is the characteristic for to be helpful to raise the power for the detection.
- What if and are uncorrelated? If they are uncorrelated, then what if and are uncorrelated?
- What if and are correlated?
After thinking about these, you will find that for to be useful, it’s ideal that is uncorrelated with and is highly correlated with , i.e. highly correlated with the error term so that it can be used to explain more variation contained in to reduce the noise level.
In order to see why, first notice that the problem exactly depends on how to understand the following multiple linear regression problem:
Now from the multiple linear regression knowledge, we have
where (see below for the proof). Thus in order to raise the signal to noise ratio, we hope that , i.e. or , which can keep the signal large. But in order to reduce the noise, we need . In summary, we need to have , which means that and are uncorrelated, and , which means that can be used to explain some variability contained in the noise.
Now please think about the question:
What is the difference between doing univariate regression one by one and doing multiple linear regression all at once?
Here is some hint: first we regress and both onto ,
And then on one hand we find that , and on the other hand we regress the residual onto the residual to get via
This procedure actually is explaining what is the multiple linear regression and what is the meaning for the coefficients (think about the meaning of from the above explanation).