--- title: "Fine-mapping with SuSiE-ash and SuSiE-inf" author: "Alex McCreight" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Fine-mapping with SuSiE-ash and SuSiE-inf} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE,comment = "#",fig.width = 5, fig.height = 3,fig.align = "center", dpi = 120) ``` This vignette demonstrates how to use the SuSiE-ash and SuSiE-inf models. We use a simulated data set with n = 1000 individuals, p = 5000 variants, and a complex genetic architecture combining 3 sparse, 5 oligogenic, and 15 polygenic effects. ## Data ```{r} library(susieR) data(unmappable_data) X <- unmappable_data$X y <- as.vector(unmappable_data$y) b <- unmappable_data$beta plot(abs(b), ylab = "Absolute Effect Size", pch = 16) points(which(b != 0), abs(b[b != 0]), col = 2, pch = 16) ``` ## Summary Statistics and Z-Scores Before fitting the models, we can examine the marginal association statistics. We use `univariate_regression()` to compute effect sizes and standard errors, then derive z-scores: ```{r} sumstats <- univariate_regression(X, y) z_scores <- sumstats$betahat / sumstats$sebetahat ``` The z-scores show the strength of marginal association for each variant. Red points indicate non-zero effect sizes: ```{r} susie_plot(z_scores, y = "z", b = b, add_legend = TRUE) ``` Here we can see the signal landscape before fine-mapping. Note that some causal variants have strong z-scores while others may be weaker or masked by LD with nearby variants. Notably, variant 2714 has the largest true effect size in the simulation: ```{r} strongest_idx <- which.max(abs(b)) cat("Strongest effect variant:", strongest_idx, "\n") cat("True effect size:", round(b[strongest_idx], 3), "\n") cat("Marginal z-score:", round(z_scores[strongest_idx], 3), "\n") cat("Marginal p-value:", format.pval(2 * pnorm(-abs(z_scores[strongest_idx]))), "\n") ``` Despite having the largest true effect, this variant has a very small marginal z-score and a large p-value. This illustrates a fundamental challenge in fine-mapping: the marginal association of a large-effect causal variant can be masked by other variants in LD, while smaller-effect variants may show stronger marginal signals. This masking makes it difficult to identify the true causal variants from marginal statistics alone. ## Step 1: Standard SuSiE and False Positives We first fit standard SuSiE: ```{r} t0 <- proc.time() fit_susie <- susie(X, y, L = 10) t1 <- proc.time() t1 - t0 susie_plot(fit_susie, y = "PIP", b = b, main = "SuSiE (standard)", add_legend = TRUE) ``` We set `L = 10` to allow SuSiE to capture up to 10 sparse effects. However, given the complex architecture with 23 true causal variants, this may be insufficient. To see which true effects the credible sets capture, we plot the CS on the true effect sizes: ```{r} plot_cs_effects <- function(fit, b, main = "") { colors <- c("dodgerblue2", "green4", "#6A3D9A", "#FF7F00", "gold1", "firebrick2") plot(abs(b), pch = 16, ylab = "Absolute Effect Size", main = main) if (!is.null(fit$sets$cs)) { for (i in rev(seq_along(fit$sets$cs))) { cs_idx <- fit$sets$cs[[i]] points(cs_idx, abs(b[cs_idx]), col = colors[(i-1) %% 6 + 1], pch = 16, cex = 1.5) } } cat(sprintf("True causals: %s\n", paste(which(b != 0), collapse=", "))) for (i in seq_along(fit$sets$cs)) { cs_idx <- fit$sets$cs[[i]] sentinel <- cs_idx[which.max(fit$pip[cs_idx])] tp <- any(b[cs_idx] != 0) cat(sprintf(" CS%d: %d %s\n", i, sentinel, ifelse(tp, "TP", "FP"))) } } plot_cs_effects(fit_susie, b, main = "SuSiE CS on true effects") ``` SuSiE identifies 5 credible sets, but examining them more closely reveals a problem. Many of these credible sets appear to be false positives arising from synthetic associations. A synthetic association occurs when a non-causal variant shows an association with the phenotype because it is in LD with true causal variants. The non-causal variant "borrows" signal from correlated effect variants, and when it is correlated with multiple effect variants, these contributions can accumulate to create an inflated signal. Let's examine one of the false positive credible sets to see this in action: ```{r} nonzero_idx <- which(b != 0) fp_cs <- fit_susie$sets$cs[["L4"]] top_var <- fp_cs[which.max(fit_susie$pip[fp_cs])] cat("False positive CS top variant:", top_var, "\n") cat("True effect (beta):", b[top_var], "\n") cat("Z-score:", round(z_scores[top_var], 2), "\n\n") cat("LD with true effect variants and their contributions:\n") contributions <- data.frame( variant = nonzero_idx, r = round(sapply(nonzero_idx, function(v) cor(X[, top_var], X[, v])), 2), beta = round(b[nonzero_idx], 2) ) contributions$r_times_beta <- round(contributions$r * contributions$beta, 2) contributions <- contributions[order(-abs(contributions$r_times_beta)), ] print(head(contributions[abs(contributions$r) > 0.1, ], 6), row.names = FALSE) ``` Variant `r top_var` has **no true effect** (beta = 0), yet it has a z-score of `r round(z_scores[top_var], 2)`. This synthetic signal arises because it is correlated with multiple effect variants. Notice that: - It has **negative LD** with negative-effect variants (2714, 2939, 2943), giving **positive** contributions - It has **positive LD** with a positive-effect variant (2903), also giving a **positive** contribution These contributions accumulate to create a synthetic signal at the non-causal variant, which SuSiE then incorrectly identifies as a distinct effect. The other false positive credible sets arise from the same artifact. ## Step 2: Increasing Purity to Reduce False Positives One approach to reduce false positives is to increase the purity threshold. By default, SuSiE uses `min_abs_corr = 0.5`. Let's try `min_abs_corr = 0.8`: ```{r} cs_pure <- susie_get_cs(fit_susie, X = X, min_abs_corr = 0.8) cat("Number of CSs with purity >= 0.8:", length(cs_pure$cs), "\n") ``` Raising the purity threshold removes some false positives, but not all of them. Some false positive credible sets have high purity because the non-causal variants within them are highly correlated with each other. These sets pass the purity filter yet still fail to contain any true causal variants. ## Step 3: Fitting SuSiE-inf SuSiE-inf models an infinitesimal component to account for unmappable effects: ```{r} t0 <- proc.time() fit_inf <- susie(X, y, L = 10, unmappable_effects = "inf") t1 <- proc.time() t1 - t0 susie_plot(fit_inf, y = "PIP", b = b, main = "SuSiE-inf", add_legend = TRUE) ``` (Note that it may take several minutes to fit the SuSiE-Inf model.) ```{r} plot_cs_effects(fit_inf, b, main = "SuSiE-inf CS on true effects") ``` SuSiE-inf is more conservative and finds only 1 credible set, eliminating the false positives. However, it also loses the true signal around position 3500 that standard SuSiE correctly identified. Remarkably, SuSiE-inf identifies the variant with the strongest true effect, the same variant we noted earlier has a very small marginal z-score and large p-value: ```{r} if (length(fit_inf$sets$cs) > 0) { inf_cs <- fit_inf$sets$cs[[1]] cat("SuSiE-inf CS contains variant", strongest_idx, ":", strongest_idx %in% inf_cs, "\n") cat("This variant has the largest true effect (beta =", round(b[strongest_idx], 3), ") but marginal z-score of only", round(z_scores[strongest_idx], 3), "\n") } ``` This is a striking, as SuSiE-inf recovers the strongest causal signal that is completely invisible in marginal statistics. The intuition is that by modeling background polygenic effects, SuSiE-inf effectively conditions on other variants, revealing signals that are otherwise masked. However, for other signals, even if we lower the coverage threshold, we cannot recover them, potentially because SuSiE-inf was too aggressive removing them early-on in the SuSiE fit: ```{r} for (cov in c(0.9, 0.8, 0.7, 0.5)) { cs <- susie_get_cs(fit_inf, X = X, coverage = cov) cat(sprintf("Coverage=%.1f: %d credible sets\n", cov, length(cs$cs))) } ``` ## Step 4: SuSiE-ash Achieves the Middle Ground SuSiE-ash uses adaptive shrinkage to model the unmappable effects, providing a middle ground between standard SuSiE and SuSiE-inf: ```{r} t0 <- proc.time() fit_ash <- susie(X, y, L = 10, unmappable_effects = "ash", verbose = TRUE) t1 <- proc.time() t1 - t0 susie_plot(fit_ash, y = "PIP", b = b, main = "SuSiE-ash", add_legend = TRUE) ``` (Note that it may take several minutes to fit the SuSiE-ash model.) ```{r} plot_cs_effects(fit_ash, b, main = "SuSiE-ash CS on true effects") ``` SuSiE-ash finds 3 correct credible sets. Still, it does not discover all 23 causal variants, nor does it recover the strongest effect (variant 2714) that SuSiE-inf found. However, the adaptive shrinkage approach allows SuSiE-ash to distinguish between true sparse signals and the polygenic background more effectively than either standard SuSiE or SuSiE-inf alone. SuSiE-ash can also be used with summary statistics via `susie_ss()`, using `mr.ash.rss` as the backend for the adaptive shrinkage component. This enables fine-mapping with unmappable effects when only sufficient statistics (X'X, X'y, y'y) are available. ```{r} XtX <- crossprod(X) Xty <- crossprod(X, y) yty <- sum(y^2) n <- nrow(X) t0 <- proc.time() fit_ash_ss <- susie_ss(XtX = XtX, Xty = Xty, yty = yty, n = n, L = 10, unmappable_effects = "ash", verbose = TRUE) t1 <- proc.time() t1 - t0 susie_plot(fit_ash_ss, y = "PIP", b = b, main = "SuSiE-ash (SS)", add_legend = TRUE) ``` ```{r} plot_cs_effects(fit_ash_ss, b, main = "SuSiE-ash (SS) CS on true effects") ``` We can verify the agreement between the two approaches: ```{r} pip_ind <- susie_get_pip(fit_ash) pip_ss <- susie_get_pip(fit_ash_ss) cat("Max |PIP difference|:", max(abs(pip_ind - pip_ss)), "\n") cat("PIP correlation:", cor(pip_ind, pip_ss), "\n") cat("Max |theta difference|:", max(abs(fit_ash$theta - fit_ash_ss$theta)), "\n") ``` When only GWAS summary statistics are available, `susie_rss()` can be used. For best agreement with individual-level analysis, we recommend providing effect sizes (`bhat`), standard errors (`shat`), and the phenotypic variance (`var_y`) along with the in-sample LD matrix (`R`) and sample size (`n`). This preserves the original data scale, allowing the adaptive shrinkage component to calibrate correctly: ```{r} R <- cor(X) bhat <- sumstats$betahat shat <- sumstats$sebetahat t0 <- proc.time() fit_ash_rss <- susie_rss(bhat = bhat, shat = shat, R = R, var_y = var(y), n = n, L = 10, unmappable_effects = "ash", estimate_residual_variance = TRUE, verbose = TRUE) t1 <- proc.time() t1 - t0 susie_plot(fit_ash_rss, y = "PIP", b = b, main = "SuSiE-ash (RSS)", add_legend = TRUE) ``` ```{r} plot_cs_effects(fit_ash_rss, b, main = "SuSiE-ash (RSS) CS on true effects") ``` With `bhat`, `shat`, `var_y`, and in-sample LD, `susie_rss` results match `susie_ss` closely: ```{r} pip_rss <- susie_get_pip(fit_ash_rss) cat("Max |PIP difference| (SS vs RSS):", max(abs(pip_ss - pip_rss)), "\n") cat("PIP correlation (SS vs RSS):", cor(pip_ss, pip_rss), "\n") ``` When only z-scores and an LD matrix are available (without `bhat`, `shat`, or `var_y`), `susie_rss` operates on a standardized scale where `var(y) = 1`. The credible sets are typically still correct, but the estimated `sigma2` will be on the standardized scale rather than the original scale. Note that when the LD matrix is not computed from the same sample as the summary statistics (out-of-sample LD), setting `estimate_residual_variance = FALSE` may be more appropriate to avoid bias from LD mismatch. ## Summary | Method | Credible Sets | False Positives | |--------|---------------|-----------------| | SuSiE (purity=0.5) | 5 | 4 | | SuSiE (purity=0.8) | 3 | 2 | | SuSiE-inf | 1 | 0 | | SuSiE-ash | 3 | 0 | | SuSiE-ash (SS) | 3 | 0 | | SuSiE-ash (RSS) | 3 | 0 | ## What if we increase L for standard SuSiE? Since the true simulation has 23 causal variants, one might ask: what if we simply increase `L` to give SuSiE more capacity? Let's try `L = 40`: ```{r} t0 <- proc.time() fit_susie_L40 <- susie(X, y, L = 40) t1 <- proc.time() t1 - t0 susie_plot(fit_susie_L40, y = "PIP", b = b, main = "SuSiE (L=40)", add_legend = TRUE) plot_cs_effects(fit_susie_L40, b, main = "SuSiE L=40 CS on true effects") ``` With `L = 40`, standard SuSiE does improve! Now it captures 4 CS with two of them true positives. However, it still produces false positives and takes considerably longer to converge. The rationale for SuSiE-ash is to avoid this concern: rather than specifying a large `L` to account for all potential effects, we use a reasonable `L` for sparse signals and let the adaptive shrinkage component absorb effects that cannot be mapped due to insufficiently specified `L`. This provides a more principled and computationally efficient approach. Naturally as a result, SuSiE-ash is more robust to the choice of `L` compared to SuSiE. For this example, setting `L` anywhere from 5 to 40 yields similar results, unlike standard SuSiE where performance varies substantially with `L`. (Readers can verify this on their own with this data-set) ## Session Information ```{r} sessionInfo() ```