Title: | Sum of Single Effects Linear Regression |
---|---|
Description: | Implements methods for variable selection in linear regression based on the "Sum of Single Effects" (SuSiE) model, as described in Wang et al (2020) <DOI:10.1101/501114> and Zou et al (2021) <DOI:10.1101/2021.11.03.467167>. These methods provide simple summaries, called "Credible Sets", for accurately quantifying uncertainty in which variables should be selected. The methods are motivated by genetic fine-mapping applications, and are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse. The fitting algorithm, a Bayesian analogue of stepwise selection methods called "Iterative Bayesian Stepwise Selection" (IBSS), is simple and fast, allowing the SuSiE model be fit to large data sets (thousands of samples and hundreds of thousands of variables). |
Authors: | Gao Wang [aut], Yuxin Zou [aut], Kaiqian Zhang [aut], Peter Carbonetto [aut, cre], Matthew Stephens [aut] |
Maintainer: | Peter Carbonetto <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 0.12.42 |
Built: | 2024-11-18 03:23:25 UTC |
Source: | https://github.com/stephenslab/susier |
Extract regression coefficients from susie fit
## S3 method for class 'susie' coef(object, ...)
## S3 method for class 'susie' coef(object, ...)
object |
A susie fit. |
... |
Additional arguments passed to the generic |
A p+1 vector, the first element being an intercept, and the remaining p elements being estimated regression coefficients.
susie_suff_stat
This is a synonym for compute_suff_stat
included for historical reasons (deprecated).
compute_ss(X, y, standardize = FALSE)
compute_ss(X, y, standardize = FALSE)
X |
An n by p matrix of covariates. |
y |
An n vector. |
standardize |
Logical flag indicating whether to standardize columns of X to unit variance prior to computing summary data. |
A list of sufficient statistics (X'X, X'y, y'y and n)
data(N2finemapping) ss = compute_ss(N2finemapping$X, N2finemapping$Y[,1])
data(N2finemapping) ss = compute_ss(N2finemapping$X, N2finemapping$Y[,1])
susie_suff_stat
Computes the sufficient statistics
and
after centering (and possibly standardizing) the
columns of
and centering
to have mean zero. We also
store the column means of
and mean of
.
compute_suff_stat(X, y, standardize = FALSE)
compute_suff_stat(X, y, standardize = FALSE)
X |
An n by p matrix of covariates. |
y |
An n vector. |
standardize |
Logical flag indicating whether to standardize columns of X to unit variance prior to computing summary data |
A list of sufficient statistics (XtX, Xty, yty, n
)
and X_colmeans
, y_mean
.
data(N2finemapping) ss = compute_suff_stat(N2finemapping$X, N2finemapping$Y[,1])
data(N2finemapping) ss = compute_suff_stat(N2finemapping$X, N2finemapping$Y[,1])
susie_rss
Model Using Regularized LDThe estimated s gives information about the
consistency between the z scores and LD matrix. A larger
means there is a strong inconsistency between z scores and LD
matrix. The “null-mle” method obtains mle of
under
,
. The
“null-partialmle” method obtains mle of
under
, in which
is a matrix containing
the of eigenvectors that span the null space of R; that is, the
eigenvectors corresponding to zero eigenvalues of R. The estimated
from “null-partialmle” could be greater than 1. The
“null-pseudomle” method obtains mle of
under
pseudolikelihood
,
.
estimate_s_rss(z, R, n, r_tol = 1e-08, method = "null-mle")
estimate_s_rss(z, R, n, r_tol = 1e-08, method = "null-mle")
z |
A p-vector of z scores. |
R |
A p by p symmetric, positive semidefinite correlation matrix. |
n |
The sample size. (Optional, but highly recommended.) |
r_tol |
Tolerance level for eigenvalue check of positive semidefinite matrix of R. |
method |
a string specifies the method to estimate |
A number between 0 and 1.
set.seed(1) n = 500 p = 1000 beta = rep(0,p) beta[1:4] = 0.01 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) input_ss = compute_suff_stat(X,y,standardize = TRUE) ss = univariate_regression(X,y) R = cor(X) attr(R,"eigen") = eigen(R,symmetric = TRUE) zhat = with(ss,betahat/sebetahat) # Estimate s using the unadjusted z-scores. s0 = estimate_s_rss(zhat,R) # Estimate s using the adjusted z-scores. s1 = estimate_s_rss(zhat,R,n)
set.seed(1) n = 500 p = 1000 beta = rep(0,p) beta[1:4] = 0.01 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) input_ss = compute_suff_stat(X,y,standardize = TRUE) ss = univariate_regression(X,y) R = cor(X) attr(R,"eigen") = eigen(R,symmetric = TRUE) zhat = with(ss,betahat/sebetahat) # Estimate s using the unadjusted z-scores. s0 = estimate_s_rss(zhat,R) # Estimate s using the adjusted z-scores. s1 = estimate_s_rss(zhat,R,n)
Data simulated using real genotypes from 50,000 individuals and 200 SNPs. Two of the SNPs have non-zero effects on the multivariate response. The response data are generated under a linear regression model. The simulated response and the columns of the genotype matrix are centered.
FinemappingConvergence
is a list with the following
elements:
Summary statistics computed using the centered and scaled genotype matrix.
Summary statistics computed using the centered and scaled genotype data, and the centered simulated response.
yty is computed using the centered simulated response.
The sample size (50,000).
The coefficients used to simulate the responses.
z-scores from a simple (single-SNP) linear regression.
A similar data set with more SNPs is used in the “Refine SuSiE model” vignette.
data(FinemappingConvergence)
data(FinemappingConvergence)
This function evaluates the correlation between single effect CSs. It is not part of the SuSiE inference. Rather, it is designed as a diagnostic tool to assess how correlated the reported CS are.
get_cs_correlation(model, X = NULL, Xcorr = NULL, max = FALSE)
get_cs_correlation(model, X = NULL, Xcorr = NULL, max = FALSE)
model |
A SuSiE fit, typically an output from
|
X |
n by p matrix of values of the p variables (covariates) in
n samples. When provided, correlation between variables will be
computed and used to remove CSs whose minimum correlation among
variables is smaller than |
Xcorr |
p by p matrix of correlations between variables
(covariates). When provided, it will be used to remove CSs whose
minimum correlation among variables is smaller than
|
max |
When |
A matrix of correlations between CSs, or the maximum
absolute correlation when max = TRUE
.
Under the null, the rss model with regularized LD
matrix is . We use a mixture of
normals to model the conditional distribution of z_j given other z
scores,
,
,
is a grid of fixed positive numbers. We estimate the mixture
weights
We detect the possible allele switch issue
using likelihood ratio for each variant.
kriging_rss( z, R, n, r_tol = 1e-08, s = estimate_s_rss(z, R, n, r_tol, method = "null-mle") )
kriging_rss( z, R, n, r_tol = 1e-08, s = estimate_s_rss(z, R, n, r_tol, method = "null-mle") )
z |
A p-vector of z scores. |
R |
A p by p symmetric, positive semidefinite correlation matrix. |
n |
The sample size. (Optional, but highly recommended.) |
r_tol |
Tolerance level for eigenvalue check of positive semidefinite matrix of R. |
s |
an estimated s from |
a list containing a ggplot2 plot object and a table. The plot compares observed z score vs the expected value. The possible allele switched variants are labeled as red points (log LR > 2 and abs(z) > 2). The table summarizes the conditional distribution for each variant and the likelihood ratio test. The table has the following columns: the observed z scores, the conditional expectation, the conditional variance, the standardized differences between the observed z score and expected value, the log likelihood ratio statistics.
# See also the vignette, "Diagnostic for fine-mapping with summary # statistics." set.seed(1) n = 500 p = 1000 beta = rep(0,p) beta[1:4] = 0.01 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) ss = univariate_regression(X,y) R = cor(X) attr(R,"eigen") = eigen(R,symmetric = TRUE) zhat = with(ss,betahat/sebetahat) cond_dist = kriging_rss(zhat,R,n = n) cond_dist$plot
# See also the vignette, "Diagnostic for fine-mapping with summary # statistics." set.seed(1) n = 500 p = 1000 beta = rep(0,p) beta[1:4] = 0.01 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) ss = univariate_regression(X,y) R = cor(X) attr(R,"eigen") = eigen(R,symmetric = TRUE) zhat = with(ss,betahat/sebetahat) cond_dist = kriging_rss(zhat,R,n = n) cond_dist$plot
This data set contains a genotype matrix for 574 individuals and 1,002 variables. The variables are genotypes after centering and scaling, and therefore retain the correlation structure of the original genotype data. Two of the variables have non-zero effects on the multivariate response. The response data are generated under a multivariate linear regression model. See Wang et al (2020) for details.
N2finemapping
is a list with the following elements:
Centered and scaled genotype data.
Chromomsome of the original data, in hg38 coordinates.
Chromomosomal position of the original data, in hg38
coordinates. The information can be used to compare impact of using
other genotype references of the same variables in susie_rss
application.
Simulated effect sizes.
Simulated residual covariance matrix.
Simulated multivariate response.
Allele frequencies based on the original genotype data.
Suggested prior covariance matrix for effect sizes of the two non-zero effect variables.
G. Wang, A. Sarkar, P. Carbonetto and M. Stephens (2020). A simple new approach to variable selection in regression, with application to genetic fine-mapping. Journal of the Royal Statistical Society, Series B doi:10.1101/501114.
data(N2finemapping)
data(N2finemapping)
The data-set contains a matrix of 574 individuals and 1,001 variables. These variables are real-world genotypes centered and scaled, and therefore retains the correlation structure of variables in the original genotype data. 3 out of the variables have non-zero effects. The response data is generated under a multivariate linear regression model. See Wang et al (2020) for more details.
N3finemapping
is a list with the following elements:
N by P variable matrix of centered and scaled genotype data.
Chromomsome of the original data, in hg38 coordinate.
Chromomosomal positoin of the original data, in hg38 coordinate. The information can be used to compare impact of using other genotype references of the same variables in susie_rss application.
The simulated effect sizes.
The simulated residual covariance matrix.
The simulated response variables.
Allele frequency of the original genotype data.
Prior covariance matrix for effect size of the three non-zero effect variables.
G. Wang, A. Sarkar, P. Carbonetto and M. Stephens (2020). A simple new approach to variable selection in regression, with application to genetic fine-mapping. Journal of the Royal Statistical Society, Series B doi:10.1101/501114.
data(N3finemapping)
data(N3finemapping)
Predict outcomes or extract coefficients from susie fit.
## S3 method for class 'susie' predict(object, newx = NULL, type = c("response", "coefficients"), ...)
## S3 method for class 'susie' predict(object, newx = NULL, type = c("response", "coefficients"), ...)
object |
A susie fit. |
newx |
A new value for X at which to do predictions. |
type |
The type of output. For |
... |
Other arguments used by generic predict function. These extra arguments are not used here. |
For type = "response"
, predicted or fitted outcomes
are returned; for type = "coefficients"
, the estimated
coefficients are returned. If the susie fit has intercept =
NA
(which is common when using susie_suff_stat
) then
predictions are computed using an intercept of 0, and a warning is
emitted.
summary
method for the “susie” class.
## S3 method for class 'susie' summary(object, ...) ## S3 method for class 'summary.susie' print(x, ...)
## S3 method for class 'susie' summary(object, ...) ## S3 method for class 'summary.susie' print(x, ...)
object |
A susie fit. |
... |
Additional arguments passed to the generic |
x |
A susie summary. |
summary.susie
returns a list containing a data frame
of variables and a data frame of credible sets.
Data simulated using real genotypes from 10,000 individuals and 200 SNPs. One SNP have non-zero effect on the multivariate response. The response data are generated under a linear regression model. There is also one SNP with flipped allele between summary statistics and the reference panel.
SummaryConsistency
is a list with the following
elements:
z-scores computed by fitting univariate simple regression variable-by-variable.
LD matrix estimated from the reference panel.
The index of the SNP with the flipped allele.
The index of the SNP with the non-zero effect.
A similar data set with more samples is used in the “Diagnostic for fine-mapping with summary statistics” vignette.
data(SummaryConsistency)
data(SummaryConsistency)
Performs a sparse Bayesian multiple linear regression
of y on X, using the "Sum of Single Effects" model from Wang et al
(2020). In brief, this function fits the regression model , where elements of
are i.i.d. normal
with zero mean and variance
residual_variance
, is
an intercept term and
is a vector of length p representing
the effects to be estimated. The “susie assumption” is that
where each
is a vector of
length p with exactly one non-zero element. The prior on the
non-zero element is normal with zero mean and variance
var(y)
* scaled_prior_variance
. The value of L
is fixed, and
should be chosen to provide a reasonable upper bound on the number
of non-zero effects to be detected. Typically, the hyperparameters
residual_variance
and scaled_prior_variance
will be
estimated during model fitting, although they can also be fixed as
specified by the user. See functions susie_get_cs
and
other functions of form susie_get_*
to extract the most
commonly-used results from a susie fit.
susie( X, y, L = min(10, ncol(X)), scaled_prior_variance = 0.2, residual_variance = NULL, prior_weights = NULL, null_weight = 0, standardize = TRUE, intercept = TRUE, estimate_residual_variance = TRUE, estimate_prior_variance = TRUE, estimate_prior_method = c("optim", "EM", "simple"), check_null_threshold = 0, prior_tol = 1e-09, residual_variance_upperbound = Inf, s_init = NULL, coverage = 0.95, min_abs_corr = 0.5, compute_univariate_zscore = FALSE, na.rm = FALSE, max_iter = 100, tol = 0.001, verbose = FALSE, track_fit = FALSE, residual_variance_lowerbound = var(drop(y))/10000, refine = FALSE, n_purity = 100 ) susie_suff_stat( XtX, Xty, yty, n, X_colmeans = NA, y_mean = NA, maf = NULL, maf_thresh = 0, L = 10, scaled_prior_variance = 0.2, residual_variance = NULL, estimate_residual_variance = TRUE, estimate_prior_variance = TRUE, estimate_prior_method = c("optim", "EM", "simple"), check_null_threshold = 0, prior_tol = 1e-09, r_tol = 1e-08, prior_weights = NULL, null_weight = 0, standardize = TRUE, max_iter = 100, s_init = NULL, coverage = 0.95, min_abs_corr = 0.5, tol = 0.001, verbose = FALSE, track_fit = FALSE, check_input = FALSE, refine = FALSE, check_prior = FALSE, n_purity = 100, ... )
susie( X, y, L = min(10, ncol(X)), scaled_prior_variance = 0.2, residual_variance = NULL, prior_weights = NULL, null_weight = 0, standardize = TRUE, intercept = TRUE, estimate_residual_variance = TRUE, estimate_prior_variance = TRUE, estimate_prior_method = c("optim", "EM", "simple"), check_null_threshold = 0, prior_tol = 1e-09, residual_variance_upperbound = Inf, s_init = NULL, coverage = 0.95, min_abs_corr = 0.5, compute_univariate_zscore = FALSE, na.rm = FALSE, max_iter = 100, tol = 0.001, verbose = FALSE, track_fit = FALSE, residual_variance_lowerbound = var(drop(y))/10000, refine = FALSE, n_purity = 100 ) susie_suff_stat( XtX, Xty, yty, n, X_colmeans = NA, y_mean = NA, maf = NULL, maf_thresh = 0, L = 10, scaled_prior_variance = 0.2, residual_variance = NULL, estimate_residual_variance = TRUE, estimate_prior_variance = TRUE, estimate_prior_method = c("optim", "EM", "simple"), check_null_threshold = 0, prior_tol = 1e-09, r_tol = 1e-08, prior_weights = NULL, null_weight = 0, standardize = TRUE, max_iter = 100, s_init = NULL, coverage = 0.95, min_abs_corr = 0.5, tol = 0.001, verbose = FALSE, track_fit = FALSE, check_input = FALSE, refine = FALSE, check_prior = FALSE, n_purity = 100, ... )
X |
An n by p matrix of covariates. |
y |
The observed responses, a vector of length n. |
L |
Maximum number of non-zero effects in the susie regression model. If L is larger than the number of covariates, p, L is set to p. |
scaled_prior_variance |
The prior variance, divided by
|
residual_variance |
Variance of the residual. If
|
prior_weights |
A vector of length p, in which each entry gives the prior probability that corresponding column of X has a nonzero effect on the outcome, y. |
null_weight |
Prior probability of no effect (a number between 0 and 1, and cannot be exactly 1). |
standardize |
If |
intercept |
If |
estimate_residual_variance |
If
|
estimate_prior_variance |
If |
estimate_prior_method |
The method used for estimating prior
variance. When |
check_null_threshold |
When the prior variance is estimated,
compare the estimate with the null, and set the prior variance to
zero unless the log-likelihood using the estimate is larger by this
threshold amount. For example, if you set
|
prior_tol |
When the prior variance is estimated, compare the
estimated value to |
residual_variance_upperbound |
Upper limit on the estimated
residual variance. It is only relevant when
|
s_init |
A previous susie fit with which to initialize. |
coverage |
A number between 0 and 1 specifying the “coverage” of the estimated confidence sets. |
min_abs_corr |
Minimum absolute correlation allowed in a credible set. The default, 0.5, corresponds to a squared correlation of 0.25, which is a commonly used threshold for genotype data in genetic studies. |
compute_univariate_zscore |
If |
na.rm |
Drop any missing values in y from both X and y. |
max_iter |
Maximum number of IBSS iterations to perform. |
tol |
A small, non-negative number specifying the convergence
tolerance for the IBSS fitting procedure. The fitting procedure
will halt when the difference in the variational lower bound, or
“ELBO” (the objective function to be maximized), is
less than |
verbose |
If |
track_fit |
If |
residual_variance_lowerbound |
Lower limit on the estimated
residual variance. It is only relevant when
|
refine |
If |
n_purity |
Passed as argument |
XtX |
A p by p matrix |
Xty |
A p-vector |
yty |
A scalar |
n |
The sample size. |
X_colmeans |
A p-vector of column means of |
y_mean |
A scalar containing the mean of |
maf |
Minor allele frequency; to be used along with
|
maf_thresh |
Variants having a minor allele frequency smaller than this threshold are not used. |
r_tol |
Tolerance level for eigenvalue check of positive semidefinite matrix of R. |
check_input |
If |
check_prior |
If |
... |
Additional arguments to provide backward compatibility
with earlier versions of |
The function susie
implements the IBSS algorithm
from Wang et al (2020). The option refine = TRUE
implements
an additional step to help reduce problems caused by convergence of
the IBSS algorithm to poor local optima (which is rare in our
experience, but can provide misleading results when it occurs). The
refinement step incurs additional computational expense that
increases with the number of CSs found in the initial run.
The function susie_suff_stat
implements essentially the same
algorithms, but using sufficient statistics. (The statistics are
sufficient for the regression coefficients , but not for the
intercept
; see below for how the intercept is treated.)
If the sufficient statistics are computed correctly then the
results from
susie_suff_stat
should be the same as (or very
similar to) susie
, although runtimes will differ as
discussed below. The sufficient statistics are the sample
size n
, and then the p by p matrix , the p-vector
, and the sum of squared y values
, all computed
after centering the columns of
and the vector
to
have mean 0; these can be computed using
compute_suff_stat
.
The handling of the intercept term in susie_suff_stat
needs
some additional explanation. Computing the summary data after
centering X
and y
effectively ensures that the
resulting posterior quantities for allow for an intercept
in the model; however, the actual value of the intercept cannot be
estimated from these centered data. To estimate the intercept term
the user must also provide the column means of
and the mean
of
(
X_colmeans
and y_mean
). If these are not
provided, they are treated as NA
, which results in the
intercept being NA
. If for some reason you prefer to have
the intercept be 0 instead of NA
then set
X_colmeans = 0,y_mean = 0
.
For completeness, we note that if susie_suff_stat
is run on
computed without centering
and
, and with
X_colmeans = 0,y_mean = 0
, this is
equivalent to susie
applied to with
intercept = FALSE
(although results may differ due to
different initializations of residual_variance
and
scaled_prior_variance
). However, this usage is not
recommended for for most situations.
The computational complexity of susie
is per
iteration, whereas
susie_suff_stat
is per
iteration (not including the cost of computing the sufficient
statistics, which is dominated by the
cost of
computing
). Because of the cost of computing
,
susie
will usually be faster. However, if ,
and/or if
is already computed, then
susie_suff_stat
may be faster.
A "susie"
object with some or all of the following
elements:
alpha |
An L by p matrix of posterior inclusion probabilites. |
mu |
An L by p matrix of posterior means, conditional on inclusion. |
mu2 |
An L by p matrix of posterior second moments, conditional on inclusion. |
Xr |
A vector of length n, equal to |
lbf |
log-Bayes Factor for each single effect. |
lbf_variable |
log-Bayes Factor for each variable and single effect. |
intercept |
Intercept (fixed or estimated). |
sigma2 |
Residual variance (fixed or estimated). |
V |
Prior variance of the non-zero elements of b, equal to
|
elbo |
The value of the variational lower bound, or “ELBO” (objective function to be maximized), achieved at each iteration of the IBSS fitting procedure. |
fitted |
Vector of length n containing the fitted values of the outcome. |
sets |
Credible sets estimated from model fit; see
|
pip |
A vector of length p giving the (marginal) posterior inclusion probabilities for all p covariates. |
z |
A vector of univariate z-scores. |
niter |
Number of IBSS iterations that were performed. |
converged |
|
susie_suff_stat
returns also outputs:
XtXr |
A p-vector of |
G. Wang, A. Sarkar, P. Carbonetto and M. Stephens (2020). A simple new approach to variable selection in regression, with application to genetic fine-mapping. Journal of the Royal Statistical Society, Series B 82, 1273-1300 doi:10.1101/501114.
Y. Zou, P. Carbonetto, G. Wang, G and M. Stephens (2022). Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genetics 18, e1010299. doi:10.1371/journal.pgen.1010299.
susie_get_cs
and other susie_get_*
functions for extracting results; susie_trendfilter
for
applying the SuSiE model to non-parametric regression, particularly
changepoint problems, and susie_rss
for applying the
SuSiE model when one only has access to limited summary statistics
related to and
(typically in genetic applications).
# susie example set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res1 = susie(X,y,L = 10) susie_get_cs(res1) # extract credible sets from fit plot(beta,coef(res1)[-1]) abline(a = 0,b = 1,col = "skyblue",lty = "dashed") plot(y,predict(res1)) abline(a = 0,b = 1,col = "skyblue",lty = "dashed") # susie_suff_stat example input_ss = compute_suff_stat(X,y) res2 = with(input_ss, susie_suff_stat(XtX = XtX,Xty = Xty,yty = yty,n = n, X_colmeans = X_colmeans,y_mean = y_mean,L = 10)) plot(coef(res1),coef(res2)) abline(a = 0,b = 1,col = "skyblue",lty = "dashed")
# susie example set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res1 = susie(X,y,L = 10) susie_get_cs(res1) # extract credible sets from fit plot(beta,coef(res1)[-1]) abline(a = 0,b = 1,col = "skyblue",lty = "dashed") plot(y,predict(res1)) abline(a = 0,b = 1,col = "skyblue",lty = "dashed") # susie_suff_stat example input_ss = compute_suff_stat(X,y) res2 = with(input_ss, susie_suff_stat(XtX = XtX,Xty = Xty,yty = yty,n = n, X_colmeans = X_colmeans,y_mean = y_mean,L = 10)) plot(coef(res1),coef(res2)) abline(a = 0,b = 1,col = "skyblue",lty = "dashed")
susie_auto
is an attempt to automate reliable
running of susie even on hard problems. It implements a three-stage
strategy for each L: first, fit susie with very small residual
error; next, estimate residual error; finally, estimate the prior
variance. If the last step estimates some prior variances to be
zero, stop. Otherwise, double L, and repeat. Initial runs are
performed with relaxed tolerance; the final run is performed using
the default susie tolerance.
susie_auto( X, y, L_init = 1, L_max = 512, verbose = FALSE, init_tol = 1, standardize = TRUE, intercept = TRUE, max_iter = 100, tol = 0.01, ... )
susie_auto( X, y, L_init = 1, L_max = 512, verbose = FALSE, init_tol = 1, standardize = TRUE, intercept = TRUE, max_iter = 100, tol = 0.01, ... )
X |
An n by p matrix of covariates. |
y |
The observed responses, a vector of length n. |
L_init |
The initial value of L. |
L_max |
The largest value of L to consider. |
verbose |
If |
init_tol |
The tolerance to passed to |
standardize |
If |
intercept |
If |
max_iter |
Maximum number of IBSS iterations to perform. |
tol |
A small, non-negative number specifying the convergence
tolerance for the IBSS fitting procedure. The fitting procedure
will halt when the difference in the variational lower bound, or
“ELBO” (the objective function to be maximized), is
less than |
... |
Additional arguments passed to |
See susie
for a description of return values.
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res = susie_auto(X,y) plot(beta,coef(res)[-1]) abline(a = 0,b = 1,col = "skyblue",lty = "dashed") plot(y,predict(res)) abline(a = 0,b = 1,col = "skyblue",lty = "dashed")
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res = susie_auto(X,y) plot(beta,coef(res)[-1]) abline(a = 0,b = 1,col = "skyblue",lty = "dashed") plot(y,predict(res)) abline(a = 0,b = 1,col = "skyblue",lty = "dashed")
These functions access basic properties or draw inferences from a fitted susie model.
susie_get_objective(res, last_only = TRUE, warning_tol = 1e-06) susie_get_posterior_mean(res, prior_tol = 1e-09) susie_get_posterior_sd(res, prior_tol = 1e-09) susie_get_niter(res) susie_get_prior_variance(res) susie_get_residual_variance(res) susie_get_lfsr(res) susie_get_posterior_samples(susie_fit, num_samples) susie_get_cs( res, X = NULL, Xcorr = NULL, coverage = 0.95, min_abs_corr = 0.5, dedup = TRUE, squared = FALSE, check_symmetric = TRUE, n_purity = 100, use_rfast ) susie_get_pip(res, prune_by_cs = FALSE, prior_tol = 1e-09)
susie_get_objective(res, last_only = TRUE, warning_tol = 1e-06) susie_get_posterior_mean(res, prior_tol = 1e-09) susie_get_posterior_sd(res, prior_tol = 1e-09) susie_get_niter(res) susie_get_prior_variance(res) susie_get_residual_variance(res) susie_get_lfsr(res) susie_get_posterior_samples(susie_fit, num_samples) susie_get_cs( res, X = NULL, Xcorr = NULL, coverage = 0.95, min_abs_corr = 0.5, dedup = TRUE, squared = FALSE, check_symmetric = TRUE, n_purity = 100, use_rfast ) susie_get_pip(res, prune_by_cs = FALSE, prior_tol = 1e-09)
res |
A susie fit, typically an output from
|
last_only |
If |
warning_tol |
Warn if ELBO is decreasing by this tolerance level. |
prior_tol |
Filter out effects having estimated prior variance smaller than this threshold. |
susie_fit |
A susie fit, an output from |
num_samples |
The number of draws from the posterior distribution. |
X |
n by p matrix of values of the p variables (covariates) in
n samples. When provided, correlation between variables will be
computed and used to remove CSs whose minimum correlation among
variables is smaller than |
Xcorr |
p by p matrix of correlations between variables
(covariates). When provided, it will be used to remove CSs whose
minimum correlation among variables is smaller than
|
coverage |
A number between 0 and 1 specifying desired coverage of each CS. |
min_abs_corr |
A "purity" threshold for the CS. Any CS that contains a pair of variables with correlation less than this threshold will be filtered out and not reported. |
dedup |
If |
squared |
If |
check_symmetric |
If |
n_purity |
The maximum number of credible set (CS) variables used in calculating the correlation (“purity”) statistics. When the number of variables included in the CS is greater than this number, the CS variables are randomly subsampled. |
use_rfast |
Use the Rfast package for the purity calculations.
By default |
prune_by_cs |
Whether or not to ignore single effects not in a reported CS when calculating PIP. |
susie_get_objective
returns the evidence lower bound
(ELBO) achieved by the fitted susie model and, optionally, at each
iteration of the IBSS fitting procedure.
susie_get_residual_variance
returns the (estimated or
fixed) residual variance parameter.
susie_get_prior_variance
returns the (estimated or fixed)
prior variance parameters.
susie_get_posterior_mean
returns the posterior mean for the
regression coefficients of the fitted susie model.
susie_get_posterior_sd
returns the posterior standard
deviation for coefficients of the fitted susie model.
susie_get_niter
returns the number of model fitting
iterations performed.
susie_get_pip
returns a vector containing the posterior
inclusion probabilities (PIPs) for all variables.
susie_get_lfsr
returns a vector containing the average lfsr
across variables for each single-effect, weighted by the posterior
inclusion probability (alpha).
susie_get_posterior_samples
returns a list containing the
effect sizes samples and causal status with two components: b
,
an num_variables
x num_samples
matrix of effect
sizes; gamma
, an num_variables
x num_samples
matrix of causal status random draws.
susie_get_cs
returns credible sets (CSs) from a susie fit,
as well as summaries of correlation among the variables included in
each CS. If desired, one can filter out CSs that do not meet a
specified “purity” threshold; to do this, either X
or
Xcorr
must be supplied. It returns a list with the following
elements:
cs |
A list in which each list element is a vector containing the indices of the variables in the CS. |
coverage |
The nominal coverage specified for each CS. |
purity |
If |
cs_index |
If |
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) s = susie(X,y,L = 10) susie_get_objective(s) susie_get_objective(s, last_only=FALSE) susie_get_residual_variance(s) susie_get_prior_variance(s) susie_get_posterior_mean(s) susie_get_posterior_sd(s) susie_get_niter(s) susie_get_pip(s) susie_get_lfsr(s)
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) s = susie(X,y,L = 10) susie_get_objective(s) susie_get_objective(s, last_only=FALSE) susie_get_residual_variance(s) susie_get_prior_variance(s) susie_get_posterior_mean(s) susie_get_posterior_sd(s) susie_get_niter(s) susie_get_pip(s) susie_get_lfsr(s)
Initialize a susie object using regression coefficients
susie_init_coef(coef_index, coef_value, p)
susie_init_coef(coef_index, coef_value, p)
coef_index |
An L-vector containing the the indices of the nonzero coefficients. |
coef_value |
An L-vector containing initial coefficient estimates. |
p |
A scalar giving the number of variables. |
A list with elements alpha
, mu
and mu2
to be used by susie
.
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[sample(1:1000,4)] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) # Initialize susie to ground-truth coefficients. s = susie_init_coef(which(beta != 0),beta[beta != 0],length(beta)) res = susie(X,y,L = 10,s_init=s)
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[sample(1:1000,4)] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) # Initialize susie to ground-truth coefficients. s = susie_init_coef(which(beta != 0),beta[beta != 0],length(beta)) res = susie(X,y,L = 10,s_init=s)
susie_plot
produces a per-variable summary of
the SuSiE credible sets. susie_plot_iteration
produces a
diagnostic plot for the susie model fitting. For
susie_plot_iteration
, several plots will be created if
track_fit = TRUE
when calling susie
.
susie_plot( model, y, add_bar = FALSE, pos = NULL, b = NULL, max_cs = 400, add_legend = NULL, ... ) susie_plot_iteration(model, L, file_prefix, pos = NULL)
susie_plot( model, y, add_bar = FALSE, pos = NULL, b = NULL, max_cs = 400, add_legend = NULL, ... ) susie_plot_iteration(model, L, file_prefix, pos = NULL)
model |
A SuSiE fit, typically an output from
|
y |
A string indicating what to plot: either |
add_bar |
If |
pos |
Indices of variables to plot. If |
b |
For simulated data, set |
max_cs |
The largest credible set to display, either based on
purity (set |
add_legend |
If |
... |
Additional arguments passed to
|
L |
An integer specifying the number of credible sets to plot. |
file_prefix |
Prefix to path of output plot file. If not
specified, the plot, or plots, will be saved to a temporary
directory generated using |
Invisibly returns NULL
.
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[sample(1:1000,4)] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res = susie(X,y,L = 10) susie_plot(res,"PIP") susie_plot(res,"PIP",add_bar = TRUE) susie_plot(res,"PIP",add_legend = TRUE) susie_plot(res,"PIP", pos=1:500, add_legend = TRUE) # Plot selected regions with adjusted x-axis position label res$genomic_position = 1000 + (1:length(res$pip)) susie_plot(res,"PIP",add_legend = TRUE, pos = list(attr = "genomic_position",start = 1000,end = 1500)) # True effects are shown in red. susie_plot(res,"PIP",b = beta,add_legend = TRUE) set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[sample(1:1000,4)] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res = susie(X,y,L = 10) susie_plot_iteration(res, L=10)
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[sample(1:1000,4)] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res = susie(X,y,L = 10) susie_plot(res,"PIP") susie_plot(res,"PIP",add_bar = TRUE) susie_plot(res,"PIP",add_legend = TRUE) susie_plot(res,"PIP", pos=1:500, add_legend = TRUE) # Plot selected regions with adjusted x-axis position label res$genomic_position = 1000 + (1:length(res$pip)) susie_plot(res,"PIP",add_legend = TRUE, pos = list(attr = "genomic_position",start = 1000,end = 1500)) # True effects are shown in red. susie_plot(res,"PIP",b = beta,add_legend = TRUE) set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[sample(1:1000,4)] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res = susie(X,y,L = 10) susie_plot_iteration(res, L=10)
Plots original data, y, overlaid with line showing susie fitted value and shaded rectangles showing credible sets for changepoint locations.
susie_plot_changepoint( s, y, line_col = "blue", line_size = 1.5, cs_col = "red" )
susie_plot_changepoint( s, y, line_col = "blue", line_size = 1.5, cs_col = "red" )
s |
A susie fit generated by
|
y |
An n-vector of observations that are ordered in time or space (assumed equally-spaced). |
line_col |
Color for the line showing fitted values. |
line_size |
Size of the lines showing fitted values |
cs_col |
Color of the shaded rectangles showing credible sets. |
A ggplot2 plot object.
set.seed(1) mu = c(rep(0,50),rep(1,50),rep(3,50),rep(-2,50),rep(0,300)) y = mu + rnorm(500) # Here we use a less sensitive tolerance so that the example takes # less time; in practice you will likely want to use a more stringent # setting such as tol = 0.001. s = susie_trendfilter(y,tol = 0.1) # Produces ggplot with credible sets for changepoints. susie_plot_changepoint(s,y)
set.seed(1) mu = c(rep(0,50),rep(1,50),rep(3,50),rep(-2,50),rep(0,300)) y = mu + rnorm(500) # Here we use a less sensitive tolerance so that the example takes # less time; in practice you will likely want to use a more stringent # setting such as tol = 0.001. s = susie_trendfilter(y,tol = 0.1) # Produces ggplot with credible sets for changepoints. susie_plot_changepoint(s,y)
susie_rss
performs variable selection under a
sparse Bayesian multiple linear regression of on
using the z-scores from standard univariate regression
of
on each column of
, an estimate,
, of
the correlation matrix for the columns of
, and optionally,
but strongly recommended, the sample size n. See
“Details” for other ways to call
susie_rss
susie_rss( z, R, n, bhat, shat, var_y, z_ld_weight = 0, estimate_residual_variance = FALSE, prior_variance = 50, check_prior = TRUE, ... )
susie_rss( z, R, n, bhat, shat, var_y, z_ld_weight = 0, estimate_residual_variance = FALSE, prior_variance = 50, check_prior = TRUE, ... )
z |
p-vector of z-scores. |
R |
p x p correlation matrix. |
n |
The sample size. |
bhat |
Alternative summary data giving the estimated effects
(a vector of length p). This, together with |
shat |
Alternative summary data giving the standard errors of
the estimated effects (a vector of length p). This, together with
|
var_y |
The sample variance of y, defined as |
z_ld_weight |
This parameter is included for backwards
compatibility with previous versions of the function, but it is no
longer recommended to set this to a non-zero value. When
|
estimate_residual_variance |
The default is FALSE, the
residual variance is fixed to 1 or variance of y. If the in-sample
LD matrix is provided, we recommend setting
|
prior_variance |
The prior variance(s) for the non-zero
noncentrality parameterss |
check_prior |
When |
... |
Other parameters to be passed to
|
In some applications, particularly genetic applications,
it is desired to fit a regression model ( say,
which we refer to as "the original regression model" or ORM)
without access to the actual values of
and
, but
given only some summary statistics.
susie_rss
assumes
availability of z-scores from standard univariate regression of
on each column of
, and an estimate,
, of the
correlation matrix for the columns of
(in genetic
applications
is sometimes called the “LD matrix”).
With the inputs z
, R
and sample size n
,
susie_rss
computes PVE-adjusted z-scores z_tilde
, and
calls susie_suff_stat
with XtX = (n-1)R
, Xty =
,
yty = n-1
, n = n
. The
output effect estimates are on the scale of in the ORM with
standardized
and
. When the LD matrix
R
and the z-scores z
are computed using the same
matrix , the results from
susie_rss
are same as, or
very similar to, susie
with standardized and
.
Alternatively, if the user provides n
, bhat
(the
univariate OLS estimates from regressing on each column of
),
shat
(the standard errors from these OLS
regressions), the in-sample correlation matrix , and the variance of
, the results
from
susie_rss
are same as susie
with and
. The effect estimates are on the same scale as the
coefficients
in the ORM with
and
.
In rare cases in which the sample size, , is unknown,
susie_rss
calls susie_suff_stat
with XtX = R
and Xty = z
, and with residual_variance = 1
. The
underlying assumption of performing the analysis in this way is
that the sample size is large (i.e., infinity), and/or the
effects are small. More formally, this combines the log-likelihood
for the noncentrality parameters, ,
with the “susie prior” on
; see
susie
and Wang et al
(2020) for details. In this case, the effect estimates returned by
susie_rss
are on the noncentrality parameter scale.
The estimate_residual_variance
setting is FALSE
by
default, which is recommended when the LD matrix is estimated from
a reference panel. When the LD matrix R
and the summary
statistics z
(or bhat
, shat
) are computed
using the same matrix , we recommend setting
estimate_residual_variance = TRUE
.
A "susie"
object with the following
elements:
alpha |
An L by p matrix of posterior inclusion probabilites. |
mu |
An L by p matrix of posterior means, conditional on inclusion. |
mu2 |
An L by p matrix of posterior second moments, conditional on inclusion. |
lbf |
log-Bayes Factor for each single effect. |
lbf_variable |
log-Bayes Factor for each variable and single effect. |
V |
Prior variance of the non-zero elements of b. |
elbo |
The value of the variational lower bound, or “ELBO” (objective function to be maximized), achieved at each iteration of the IBSS fitting procedure. |
sets |
Credible sets estimated from model fit; see
|
pip |
A vector of length p giving the (marginal) posterior inclusion probabilities for all p covariates. |
niter |
Number of IBSS iterations that were performed. |
converged |
|
G. Wang, A. Sarkar, P. Carbonetto and M. Stephens (2020). A simple new approach to variable selection in regression, with application to genetic fine-mapping. Journal of the Royal Statistical Society, Series B 82, 1273-1300 doi:10.1101/501114.
Y. Zou, P. Carbonetto, G. Wang, G and M. Stephens (2022). Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genetics 18, e1010299. doi:10.1371/journal.pgen.1010299.
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) input_ss = compute_suff_stat(X,y,standardize = TRUE) ss = univariate_regression(X,y) R = with(input_ss,cov2cor(XtX)) zhat = with(ss,betahat/sebetahat) res = susie_rss(zhat,R, n=n) # Toy example illustrating behaviour susie_rss when the z-scores # are mostly consistent with a non-invertible correlation matrix. # Here the CS should contain both variables, and two PIPs should # be nearly the same. z = c(6,6.01) R = matrix(1,2,2) fit = susie_rss(z,R) print(fit$sets$cs) print(fit$pip) # In this second toy example, the only difference is that one # z-score is much larger than the other. Here we expect that the # second PIP will be much larger than the first. z = c(6,7) R = matrix(1,2,2) fit = susie_rss(z,R) print(fit$sets$cs) print(fit$pip)
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) input_ss = compute_suff_stat(X,y,standardize = TRUE) ss = univariate_regression(X,y) R = with(input_ss,cov2cor(XtX)) zhat = with(ss,betahat/sebetahat) res = susie_rss(zhat,R, n=n) # Toy example illustrating behaviour susie_rss when the z-scores # are mostly consistent with a non-invertible correlation matrix. # Here the CS should contain both variables, and two PIPs should # be nearly the same. z = c(6,6.01) R = matrix(1,2,2) fit = susie_rss(z,R) print(fit$sets$cs) print(fit$pip) # In this second toy example, the only difference is that one # z-score is much larger than the other. Here we expect that the # second PIP will be much larger than the first. z = c(6,7) R = matrix(1,2,2) fit = susie_rss(z,R) print(fit$sets$cs) print(fit$pip)
Fits the non-parametric Gaussian regression model
, where the mean
is modelled as
, X is a matrix with columns containing an appropriate basis,
and b is vector with a (sparse) SuSiE prior. In particular, when
order = 0
, the jth column of X is a vector with the first j
elements equal to zero, and the remaining elements equal to 1, so
that corresponds to the change in the mean of y between
indices j and j+1. For background on trend filtering, see
Tibshirani (2014). See also the "Trend filtering" vignette,
vignette("trend_filtering")
.
susie_trendfilter(y, order = 0, standardize = FALSE, use_mad = TRUE, ...)
susie_trendfilter(y, order = 0, standardize = FALSE, use_mad = TRUE, ...)
y |
An n-vector of observations ordered in time or space (assumed to be equally spaced). |
order |
An integer specifying the order of trend filtering.
The default, |
standardize |
Logical indicating whether to standardize the X
variables ("basis functions"); |
use_mad |
Logical indicating whether to use the "median
absolute deviation" (MAD) method to the estimate residual
variance. If |
... |
Other arguments passed to |
This implementation exploits the special structure of X,
which means that the matrix-vector product is fast to
compute; in particular, the computation time is
rather
than
if
X
were formed explicitly. For
implementation details, see the "Implementation of SuSiE trend
filtering" vignette by running
vignette("trendfiltering_derivations")
.
A "susie" fit; see susie
for details.
R. J. Tibshirani (2014). Adaptive piecewise polynomial estimation via trend filtering. Annals of Statistics 42, 285-323.
set.seed(1) mu = c(rep(0,50),rep(1,50),rep(3,50),rep(-2,50),rep(0,200)) y = mu + rnorm(400) s = susie_trendfilter(y) plot(y) lines(mu,col = 1,lwd = 3) lines(predict(s),col = 2,lwd = 2) # Calculate credible sets (indices of y that occur just before # changepoints). susie_get_cs(s) # Plot with credible sets for changepoints. susie_plot_changepoint(s,y)
set.seed(1) mu = c(rep(0,50),rep(1,50),rep(3,50),rep(-2,50),rep(0,200)) y = mu + rnorm(400) s = susie_trendfilter(y) plot(y) lines(mu,col = 1,lwd = 3) lines(predict(s),col = 2,lwd = 2) # Calculate credible sets (indices of y that occur just before # changepoints). susie_get_cs(s) # Plot with credible sets for changepoints. susie_plot_changepoint(s,y)
This function performs the univariate linear
regression y ~ x separately for each column x of X. Each regression
is implemented using .lm.fit()
. The estimated effect size
and stardard error for each variable are outputted.
univariate_regression( X, y, Z = NULL, center = TRUE, scale = FALSE, return_residuals = FALSE )
univariate_regression( X, y, Z = NULL, center = TRUE, scale = FALSE, return_residuals = FALSE )
X |
n by p matrix of regressors. |
y |
n-vector of response variables. |
Z |
Optional n by k matrix of covariates to be included in all
regresions. If Z is not |
center |
If |
scale |
If |
return_residuals |
Whether or not to output the residuals if Z
is not |
A list with two vectors containing the least-squares
estimates of the coefficients (betahat
) and their standard
errors (sebetahat
). Optionally, and only when a matrix of
covariates Z
is provided, a third vector residuals
containing the residuals is returned.
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res = univariate_regression(X,y) plot(res$betahat/res$sebetahat)
set.seed(1) n = 1000 p = 1000 beta = rep(0,p) beta[1:4] = 1 X = matrix(rnorm(n*p),nrow = n,ncol = p) X = scale(X,center = TRUE,scale = TRUE) y = drop(X %*% beta + rnorm(n)) res = univariate_regression(X,y) plot(res$betahat/res$sebetahat)