Last night, I had a discussion about the integrative data analysis (closely related with the discussion of AOAS 2014 paper from Dr Xihong Lin’s group and JASA 2014 paper from Dr. Hongzhe Li’s group) with my friend. If some biologist gave you the genetic variants (e.g. SNP) data and the phenotype (e.g. some trait) data, you were asked to do the association analysis to identify the genetic variants which is significantly associated with the trait. One year later, the biologist got some additional data such as gene expression data which are related with the two data sets given before, and you are now asked to calibrate your analysis to detect the association more efficiently and powerfully by integrating the three data sources. In this data rich age, it’s quite natural to get into this situation in practice. The question is how to come up with a natural and useful statistical framework to deal with such data integration.

For simplicity, we considered the problem that if you are first given two random variables, $X, Y$ to study the association between them. Later on you are given another random variable $Z$ to help to detect the significance association between $X$ and $Y$. We assume the following true model:

$Y=\beta X+\epsilon,$

where $X$ is independent with $\epsilon$. Now the question is what is the characteristic for $Z$ to be helpful to raise the power for the detection.

• What if $X$ and $Z$ are uncorrelated? If they are uncorrelated, then what if $Y$ and $Z$ are uncorrelated?
• What if $X$ and $Z$ are correlated?

After thinking about these, you will find that for $Z$ to be useful, it’s ideal that $Z$ is uncorrelated with $X$ and is highly correlated with $Y$, i.e. highly correlated with the error term $\epsilon$ so that it can be used to explain more variation contained in $Y$ to reduce the noise level.

In order to see why, first notice that the problem exactly depends on how to understand the following multiple linear regression problem:

$Y=\alpha X+ \gamma Z+\varepsilon.$

Now from the multiple linear regression knowledge, we have

$\beta=\alpha+\gamma\times\delta$

where $Z=\delta X+\eta$ (see below for the proof). Thus in order to raise the signal to noise ratio, we hope that $\alpha=\beta$, i.e. $\gamma=0$ or $\delta=0$, which can keep the signal large. But in order to reduce the noise, we need $\gamma\neq 0$. In summary, we need to have $\delta=0$, which means that $X$ and $Z$ are uncorrelated, and $\gamma\neq 0$, which means that $Z$ can be used to explain some variability contained in the noise.

What is the difference between doing univariate regression one by one and doing multiple linear regression all at once?

Here is some hint: first we regress $Y$ and $Z$ both onto $X$,

$E(Y|X)=\alpha X+\gamma\delta X, E(Z|X)=\delta X.$

And then on one hand we find that $\beta=\alpha+\gamma\delta$, and on the other hand we regress the residual $Y-E(Y|X)=\gamma\eta+\varepsilon$ onto the residual $Z-E(Z|X)=\eta$ to get $\gamma$ via

$Y-E(Y|X)=\gamma [Z-E(Z|X)]+\varepsilon.$

This procedure actually is explaining what is the multiple linear regression and what is the meaning for the coefficients (think about the meaning of $\gamma$ from the above explanation).