This vignette introduces the important topic of using data-driven
covariance matries in mashr
. You should read the introductory vignette before this.
Recall the four major steps in a mash analysis:
Read in the data
Set up the covariance matrices to be used
Fit the model
Extract posterior summaries and other quantities of interest
Here we pay more attention to Step 2, which can be split into two steps:
Set up the “canonical” covariance matrices.
Set up the “data-driven” covariance matrices.
The first of these steps (“canonical” covariance matrices) is
straightforward using cov_canonical
, as illustrated in the introductory vignette.
Setting up the data-driven matrices is slightly more complex, and there are multiple possible approaches. This vignette illustrates one approach.
First we simulate some data for illustration (see the the introductory vignette for more details):
Although the user is free to set up data driven covariance matrices using any method that they would like, typically we suggest the following three-step strategy:
Locate some strong signals. For example, here we do this within R
by running a “condition-by-condition” using mash_1by1
, but
you could do this as a preprocessing step in other software - for
example in eQTL studies you might instead put together a separate
dataset containing the test results for only the “top” eQTL or eQTLs in
each gene.
Use methods such as PCA or factor analysis on these top signals
to compute some initial data-driven covariance matrices (e.g. using
cov_pca
).
Use these data-driven covariance matrices as initializations for
the extreme deconvolution (ED) algorithm (using cov_ed
), to
get some refined data-driven covariance matrix estimates.
The ED step is helpful primarily because the second step will often estimate the covariances of the observed data (Bhat) whereas what is required is the covariances of the actual underlying effects (B), and this is what ED estimates. However, ED can be quite sensitive to initialization, and so the goal of the second step is to provide a good initialization.
If your entire data set (matrix of all tests in all
conditions) is not too big (eg <100k tests say) then you can simply
do this by setting up the entire data set as a mash data object (using
mash_set_data
), and then running a condition-by-condition
(1by1) analysis on all the data. For example, here we select the strong
signals as those with lfsr<0.05 in any condition in the 1by1
analysis.
data = mash_set_data(simdata$Bhat, simdata$Shat)
m.1by1 = mash_1by1(data)
strong = get_significant_results(m.1by1,0.05)
This sets up a vector strong
containing the indices
corresponding to the “significant” rows of the test results in
data
. This vector is used in subsequent functions below to
specify which rows of data
to use.
There are various approaches to this; here we just illustrate the
simplest and quickest (but probably not the best), which is to use PCA.
Specifically here we use the function cov_pca
to produce
covariance matrices based on the top 5 PCs of the strong signals. The
result is a list of 6 covariance matrices: one based on all 5 PCs, and
the others each based on one PC.
After all this we are ready to run mash
using the data
driven matrices. Remember the Crucial Rule that we must fit mash to
all tests - do not use only the strong
subset
here!!
From the fit we see that the fit using the data-driven covariances is not as good as when we used the canonical covariances (which was -16120.32; from the introductory vignette). This is expected for this simulation because the simulation actually used the canonical covariances!
In general we recommend running mash
with both
data-driven and canonical covariances. You could do this by combining
the data-driven and canonical covariances as in this code:
For an example with simulations that do not follow the standard canonical matrices see here.