---
title: "Fine-mapping with SuSiE-ash and SuSiE-inf"
author: "Alex McCreight"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fine-mapping with SuSiE-ash and SuSiE-inf}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE,comment = "#",fig.width = 5,
                      fig.height = 3,fig.align = "center",
                      dpi = 120)
```

This vignette demonstrates how to use the SuSiE-ash and SuSiE-inf models.
We use a simulated data set with n = 1000 individuals, p = 5000 variants, and
a complex genetic architecture combining 3 sparse, 5 oligogenic, and 15 polygenic effects.

## Data

```{r}
library(susieR)
data(unmappable_data)

X <- unmappable_data$X
y <- as.vector(unmappable_data$y)
b <- unmappable_data$beta

plot(abs(b), ylab = "Absolute Effect Size", pch = 16)
points(which(b != 0), abs(b[b != 0]), col = 2, pch = 16)
```

## Summary Statistics and Z-Scores

Before fitting the models, we can examine the marginal association statistics.
We use `univariate_regression()` to compute effect sizes and standard errors,
then derive z-scores:

```{r}
sumstats <- univariate_regression(X, y)
z_scores <- sumstats$betahat / sumstats$sebetahat
```

The z-scores show the strength of marginal association for each variant.
Red points indicate non-zero effect sizes:

```{r}
susie_plot(z_scores, y = "z", b = b, add_legend = TRUE)
```

Here we can see the signal landscape before fine-mapping. Note that 
some causal variants have strong z-scores while others may be weaker
or masked by LD with nearby variants.

Notably, variant 2714 has the largest true effect size in the simulation:

```{r}
strongest_idx <- which.max(abs(b))
cat("Strongest effect variant:", strongest_idx, "\n")
cat("True effect size:", round(b[strongest_idx], 3), "\n")
cat("Marginal z-score:", round(z_scores[strongest_idx], 3), "\n")
cat("Marginal p-value:", format.pval(2 * pnorm(-abs(z_scores[strongest_idx]))), "\n")
```

Despite having the largest true effect, this variant has a very small marginal z-score and 
a large p-value. This illustrates a fundamental challenge in fine-mapping: the marginal 
association of a large-effect causal variant can be masked by other variants in LD, 
while smaller-effect variants may show stronger marginal signals. 
This masking makes it difficult to identify the true causal variants from marginal statistics alone.

## Step 1: Standard SuSiE and False Positives

We first fit standard SuSiE:

```{r}
t0 <- proc.time()
fit_susie <- susie(X, y, L = 10)
t1 <- proc.time()
t1 - t0
susie_plot(fit_susie, y = "PIP", b = b, main = "SuSiE (standard)", add_legend = TRUE)
```

We set `L = 10` to allow SuSiE to capture up to 10 sparse effects. 
However, given the complex architecture with 23 true causal variants, this may be insufficient.

To see which true effects the credible sets capture, we plot the CS on
the true effect sizes:

```{r}
plot_cs_effects <- function(fit, b, main = "") {
  colors <- c("dodgerblue2", "green4", "#6A3D9A", "#FF7F00", "gold1", "firebrick2")
  plot(abs(b), pch = 16, ylab = "Absolute Effect Size", main = main)
  if (!is.null(fit$sets$cs)) {
    for (i in rev(seq_along(fit$sets$cs))) {
      cs_idx <- fit$sets$cs[[i]]
      points(cs_idx, abs(b[cs_idx]), col = colors[(i-1) %% 6 + 1], pch = 16, cex = 1.5)
    }
  }
  
  cat(sprintf("True causals: %s\n", paste(which(b != 0), collapse=", ")))
  for (i in seq_along(fit$sets$cs)) {
    cs_idx <- fit$sets$cs[[i]]
    sentinel <- cs_idx[which.max(fit$pip[cs_idx])]
    tp <- any(b[cs_idx] != 0)
    cat(sprintf("  CS%d: %d %s\n", i, sentinel, ifelse(tp, "TP", "FP")))
  }
}

plot_cs_effects(fit_susie, b, main = "SuSiE CS on true effects")
```

SuSiE identifies 5 credible sets, but examining them more closely reveals a problem.
Many of these credible sets appear to be false positives arising from
synthetic associations.

A synthetic association occurs when a non-causal variant shows an association
with the phenotype because it is in LD with true causal variants. The non-causal
variant "borrows" signal from correlated effect variants, and when it is
correlated with multiple effect variants, these contributions can accumulate
to create an inflated signal.

Let's examine one of the false positive credible sets to see this in action:

```{r}
nonzero_idx <- which(b != 0)
fp_cs <- fit_susie$sets$cs[["L4"]]
top_var <- fp_cs[which.max(fit_susie$pip[fp_cs])]

cat("False positive CS top variant:", top_var, "\n")
cat("True effect (beta):", b[top_var], "\n")
cat("Z-score:", round(z_scores[top_var], 2), "\n\n")

cat("LD with true effect variants and their contributions:\n")
contributions <- data.frame(
  variant = nonzero_idx,
  r = round(sapply(nonzero_idx, function(v) cor(X[, top_var], X[, v])), 2),
  beta = round(b[nonzero_idx], 2)
)
contributions$r_times_beta <- round(contributions$r * contributions$beta, 2)
contributions <- contributions[order(-abs(contributions$r_times_beta)), ]
print(head(contributions[abs(contributions$r) > 0.1, ], 6), row.names = FALSE)
```

Variant `r top_var` has **no true effect** (beta = 0), yet it has a z-score of
`r round(z_scores[top_var], 2)`. This synthetic signal arises because it is
correlated with multiple effect variants. Notice that:

- It has **negative LD** with negative-effect variants (2714, 2939, 2943), giving **positive** contributions
- It has **positive LD** with a positive-effect variant (2903), also giving a **positive** contribution

These contributions accumulate to create a synthetic signal at the non-causal
variant, which SuSiE then incorrectly identifies as a distinct effect. The other
false positive credible sets arise from the same artifact.

## Step 2: Increasing Purity to Reduce False Positives

One approach to reduce false positives is to increase the purity threshold.
By default, SuSiE uses `min_abs_corr = 0.5`. Let's try `min_abs_corr = 0.8`:

```{r}
cs_pure <- susie_get_cs(fit_susie, X = X, min_abs_corr = 0.8)
cat("Number of CSs with purity >= 0.8:", length(cs_pure$cs), "\n")
```

Raising the purity threshold removes some false positives, but not all of them.
Some false positive credible sets have high purity because the non-causal
variants within them are highly correlated with each other. These sets pass the
purity filter yet still fail to contain any true causal variants.

## Step 3: Fitting SuSiE-inf

SuSiE-inf models an infinitesimal component to account for unmappable effects:

```{r}
t0 <- proc.time()
fit_inf <- susie(X, y, L = 10, unmappable_effects = "inf")
t1 <- proc.time()
t1 - t0
susie_plot(fit_inf, y = "PIP", b = b, main = "SuSiE-inf", add_legend = TRUE)
```

(Note that it may take several minutes to fit the SuSiE-Inf model.)

```{r}
plot_cs_effects(fit_inf, b, main = "SuSiE-inf CS on true effects")
```

SuSiE-inf is more conservative and finds only 1 credible set, eliminating the
false positives. However, it also loses the true signal around position 3500
that standard SuSiE correctly identified.

Remarkably, SuSiE-inf identifies the variant with the strongest true effect, the 
same variant we noted earlier has a very small marginal z-score and large p-value:

```{r}
if (length(fit_inf$sets$cs) > 0) {
  inf_cs <- fit_inf$sets$cs[[1]]
  cat("SuSiE-inf CS contains variant", strongest_idx, ":", strongest_idx %in% inf_cs, "\n")
  cat("This variant has the largest true effect (beta =", round(b[strongest_idx], 3), 
      ") but marginal z-score of only", round(z_scores[strongest_idx], 3), "\n")
}
```

This is a striking, as SuSiE-inf recovers the strongest causal signal that is completely 
invisible in marginal statistics. The intuition is that by modeling background polygenic 
effects, SuSiE-inf effectively conditions on other variants, revealing signals that are otherwise masked.

However, for other signals, even if we lower the coverage threshold, we cannot recover 
them, potentially because SuSiE-inf was too aggressive removing them early-on in the SuSiE fit:

```{r}
for (cov in c(0.9, 0.8, 0.7, 0.5)) {
  cs <- susie_get_cs(fit_inf, X = X, coverage = cov)
  cat(sprintf("Coverage=%.1f: %d credible sets\n", cov, length(cs$cs)))
}
```

## Step 4: SuSiE-ash Achieves the Middle Ground

SuSiE-ash uses adaptive shrinkage to model the unmappable effects, providing
a middle ground between standard SuSiE and SuSiE-inf:

```{r}
t0 <- proc.time()
fit_ash <- susie(X, y, L = 10, unmappable_effects = "ash", verbose = TRUE)
t1 <- proc.time()
t1 - t0
susie_plot(fit_ash, y = "PIP", b = b, main = "SuSiE-ash", add_legend = TRUE)
```

(Note that it may take several minutes to fit the SuSiE-ash model.)

```{r}
plot_cs_effects(fit_ash, b, main = "SuSiE-ash CS on true effects")
```

SuSiE-ash finds 3 correct credible sets.

Still, it does not discover all 23 causal variants, nor does it recover the strongest effect (variant 2714) that SuSiE-inf found.
However, the adaptive shrinkage approach allows SuSiE-ash to distinguish between true
sparse signals and the polygenic background more effectively than either
standard SuSiE or SuSiE-inf alone.


SuSiE-ash can also be used with summary statistics via `susie_ss()`, using
`mr.ash.rss` as the backend for the adaptive shrinkage component. This enables
fine-mapping with unmappable effects when only sufficient statistics (X'X, X'y,
y'y) are available.

```{r}
XtX <- crossprod(X)
Xty <- crossprod(X, y)
yty <- sum(y^2)
n <- nrow(X)

t0 <- proc.time()
fit_ash_ss <- susie_ss(XtX = XtX, Xty = Xty, yty = yty, n = n,
                        L = 10, unmappable_effects = "ash", verbose = TRUE)
t1 <- proc.time()
t1 - t0
susie_plot(fit_ash_ss, y = "PIP", b = b, main = "SuSiE-ash (SS)", add_legend = TRUE)
```

```{r}
plot_cs_effects(fit_ash_ss, b, main = "SuSiE-ash (SS) CS on true effects")
```

We can verify the agreement between the two approaches:

```{r}
pip_ind <- susie_get_pip(fit_ash)
pip_ss <- susie_get_pip(fit_ash_ss)
cat("Max |PIP difference|:", max(abs(pip_ind - pip_ss)), "\n")
cat("PIP correlation:", cor(pip_ind, pip_ss), "\n")
cat("Max |theta difference|:", max(abs(fit_ash$theta - fit_ash_ss$theta)), "\n")
```

When only GWAS summary statistics are available, `susie_rss()` can be used.
For best agreement with individual-level analysis, we recommend providing
effect sizes (`bhat`), standard errors (`shat`), and the phenotypic variance
(`var_y`) along with the in-sample LD matrix (`R`) and sample size (`n`).
This preserves the original data scale, allowing the adaptive shrinkage
component to calibrate correctly:

```{r}
R <- cor(X)
bhat <- sumstats$betahat
shat <- sumstats$sebetahat

t0 <- proc.time()
fit_ash_rss <- susie_rss(bhat = bhat, shat = shat, R = R,
                          var_y = var(y), n = n,
                          L = 10, unmappable_effects = "ash",
                          estimate_residual_variance = TRUE, verbose = TRUE)
t1 <- proc.time()
t1 - t0
susie_plot(fit_ash_rss, y = "PIP", b = b, main = "SuSiE-ash (RSS)", add_legend = TRUE)
```

```{r}
plot_cs_effects(fit_ash_rss, b, main = "SuSiE-ash (RSS) CS on true effects")
```

With `bhat`, `shat`, `var_y`, and in-sample LD, `susie_rss` results match
`susie_ss` closely:

```{r}
pip_rss <- susie_get_pip(fit_ash_rss)
cat("Max |PIP difference| (SS vs RSS):", max(abs(pip_ss - pip_rss)), "\n")
cat("PIP correlation (SS vs RSS):", cor(pip_ss, pip_rss), "\n")
```

When only z-scores and an LD matrix are available (without `bhat`, `shat`,
or `var_y`), `susie_rss` operates on a standardized scale where `var(y) = 1`.
The credible sets are typically still correct, but the estimated `sigma2`
will be on the standardized scale rather than the original scale.
Note that when the LD matrix is not computed from the same sample as the
summary statistics (out-of-sample LD), setting
`estimate_residual_variance = FALSE` may be more appropriate to avoid
bias from LD mismatch.

## Summary

| Method | Credible Sets | False Positives |
|--------|---------------|-----------------|
| SuSiE (purity=0.5) | 5 | 4 |
| SuSiE (purity=0.8) | 3 | 2 |
| SuSiE-inf | 1 | 0 |
| SuSiE-ash | 3 | 0 |
| SuSiE-ash (SS) | 3 | 0 |
| SuSiE-ash (RSS) | 3 | 0 |

## What if we increase L for standard SuSiE?

Since the true simulation has 23 causal variants, 
one might ask: what if we simply increase `L` to give SuSiE more capacity? 
Let's try `L = 40`:

```{r}
t0 <- proc.time()
fit_susie_L40 <- susie(X, y, L = 40)
t1 <- proc.time()
t1 - t0
susie_plot(fit_susie_L40, y = "PIP", b = b, main = "SuSiE (L=40)", add_legend = TRUE)
plot_cs_effects(fit_susie_L40, b, main = "SuSiE L=40 CS on true effects")
```

With `L = 40`, standard SuSiE does improve! Now it captures 4 CS with two of them true positives. 
However, it still produces false positives and takes considerably longer to converge.

The rationale for SuSiE-ash is to avoid this concern: rather than specifying a large `L` to account 
for all potential effects, we use a reasonable `L` for sparse signals and let the adaptive shrinkage 
component absorb effects that cannot be mapped due to insufficiently specified `L`. 
This provides a more principled and computationally efficient approach.

Naturally as a result, SuSiE-ash is more robust to the choice of `L` compared to SuSiE. 
For this example, setting `L` anywhere from 5 to 40 yields similar results, 
unlike standard SuSiE where performance varies substantially with `L`. 
(Readers can verify this on their own with this data-set)

## Session Information

```{r}
sessionInfo()
```