---
title: "Empirical Bayes matrix factorization for data driven prior"
author: "Gao Wang" 
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{EBMN for prior}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,comment = "#",fig.width = 5,
                      fig.height = 4,fig.align = "center",
                      eval = FALSE)
```

# Introduction

This is continuation of the [eQTL analysis vignette][eqtl].
In that vignette we have used PCA to compute data driven covariances. Here we demonstrate
the use of additional data driven covariance, via [flash](https://github.com/stephenslab/flashr) decomposition.

# Dataset simulation

Same as the [eQTL analysis vignette][eqtl] we simulate a toy data-set,

```{r}
library(ashr)
library(mashr)
set.seed(1)
simdata = simple_sims(10000,5,1) # simulates data on 40k tests

# identify a subset of strong tests
m.1by1 = mash_1by1(mash_set_data(simdata$Bhat,simdata$Shat))
strong.subset = get_significant_results(m.1by1,0.05)

# identify a random subset of 5000 tests
random.subset = sample(1:nrow(simdata$Bhat),5000)
```

and create `random` and `strong` sets,

```{r}
data.temp = mash_set_data(simdata$Bhat[random.subset,],simdata$Shat[random.subset,])
Vhat = estimate_null_correlation_simple(data.temp)
rm(data.temp)
data.random = mash_set_data(simdata$Bhat[random.subset,],simdata$Shat[random.subset,],V=Vhat)
data.strong = mash_set_data(simdata$Bhat[strong.subset,],simdata$Shat[strong.subset,], V=Vhat)
```

# FLASH analysis

We first perform the empirical Bayes matrix factorization via `flashr` using its default settings,

```{r}
U.f = cov_flash(data.strong)
U.f
```

Alternatively, as [suggested by Jason Willwerscheid for multi-tissue QTL studies][willwerscheid],
constrain factors to non-negative values. We can use `tag` parameter to customize the names in the output list:

```{r}
U.f = cov_flash(data.strong, factors="nonneg", tag="non_neg")
U.f
```

# Finalize covariances

```{r}
U.pca = cov_pca(data.strong, 5)
U.ed = cov_ed(data.strong, c(U.f, U.pca))
U.c = cov_canonical(data.random)
```
    
# Fit mash model (estimate mixture proportions)

Now we fit mash to the random tests using both data-driven and canonical covariances. 
```{r}
m = mash(data.random, Ulist = c(U.ed,U.c), outputlevel = 1)
```

# Compute posterior summaries

Now we can compute posterior summaries etc for any subset of tests using the above mash fit. Here we do this for the `strong` tests.
```{r}
m2 = mash(data.strong, g=get_fitted_g(m), fixg=TRUE)
head(get_lfsr(m2))
```
[eqtl]: https://stephenslab.github.io/mashr/articles/eQTL_outline.html
[willwerscheid]: https://willwerscheid.github.io/MASHvFLASH/MASHvFLASHnn2.html