Title: | Fast Algorithms for Generalized Principal Component Analysis |
---|---|
Description: | Implements fast, scalable optimization algorithms for fitting generalized principal components analysis (GLM-PCA) models, as described in "A Generalization of Principal Components Analysis to the Exponential Family" Collins M, Dasgupta S, Schapire RE (2002, ISBN:9780262271738), and subsequently "Feature Selection and Dimension Reduction for Single-Cell RNA-Seq Based on a Multinomial Model" Townes FW, Hicks SC, Aryee MJ, Irizarry RA (2019) <doi:10.1186/s13059-019-1861-6>. |
Authors: | Eric Weine [aut, cre], Peter Carbonetto [aut], Matthew Stephens [aut] |
Maintainer: | Eric Weine <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1-106 |
Built: | 2024-11-21 04:52:07 UTC |
Source: | https://github.com/stephenslab/fastglmpca |
Fit a Poisson GLM-PCA model by maximum-likelihood.
fit_glmpca_pois( Y, K, fit0 = init_glmpca_pois(Y, K), verbose = TRUE, control = list() ) fit_glmpca_pois_control_default() init_glmpca_pois( Y, K, U, V, X = numeric(0), Z = numeric(0), B = numeric(0), W = numeric(0), fixed_b_cols = numeric(0), fixed_w_cols = numeric(0), col_size_factor = TRUE, row_intercept = TRUE )
fit_glmpca_pois( Y, K, fit0 = init_glmpca_pois(Y, K), verbose = TRUE, control = list() ) fit_glmpca_pois_control_default() init_glmpca_pois( Y, K, U, V, X = numeric(0), Z = numeric(0), B = numeric(0), W = numeric(0), fixed_b_cols = numeric(0), fixed_w_cols = numeric(0), col_size_factor = TRUE, row_intercept = TRUE )
Y |
The n x m matrix of counts; all entries of |
K |
Integer 1 or greater specifying the rank of the matrix
factorization. This should only be provided if the initial fit
( |
fit0 |
Initial model fit. It should be an object of class
“glmpca_fit_pois”, such as an output from
|
verbose |
If |
control |
List of control parameters to modify behavior of the optimization algorithm; see “Details”. |
U |
An optional argument giving the initial estimate of the
loadings matrix. It should be an n x K matrix, where n is the
number of rows in the counts matrix |
V |
An optional argument giving is the initial estimate of the
factors matrix. It should be a m x K matrix, where m is the number
of columns in the counts matrix |
X |
Optional argument giving row covariates of the count
matrix |
Z |
Optional argument giving column covariates of the count
matrix |
B |
Optional argument giving the initial estimates for the coefficients of the row covariates. It should be an m x nx matrix, where nx is the number of row covariates. This argument is ignored if X is not provided. |
W |
Optional argument giving the initial estimates for the coefficients of the column covariates. It should be an n x nz matrix, where nz is the number of column covariates. This argument is ignored if Z is not provided. |
fixed_b_cols |
Optional numeric vector specifying which
columns of |
fixed_w_cols |
Optional numeric vector specifying which
columns of |
col_size_factor |
If |
row_intercept |
If |
In generalized principal component analysis (GLM-PCA)
based on a Poisson likelihood, the counts stored in an
matrix
are modeled as
in which the logarithm of each rate
parameter is defined as a linear combination of
rank-K matrices to be estimated from the data:
where and
are
orthogonal matrices of dimension
and
, respectively, and
is a diagonal
matrix in which the entries along its diagonal are
positive and decreasing.
is a tuning parameter specifying
the rank of the matrix factorization. This is the same as the
low-rank matrix decomposition underlying PCA (that is, the singular
value decomposition), but because we are not using a linear
(Gaussian) model, this is called “generalized PCA” or
“GLM PCA”.
To allow for additional components that may be fixed,
fit_glmpca_pois
can also fit the more general model
in which
,
are fixed matrices of dimension
and
, respectively, and
,
are matrices of dimension
and
to be estimated from the data.
fit_glmpca_pois
computes maximum-likelihood estimates (MLEs)
of ,
,
,
and
satistifying the
orthogonality constraints for
and
and the
additional constraints on
that the entries are positive and
decreasing. This is accomplished by iteratively fitting a series of
Poisson GLMs, where each of these individual Poissons GLMs is fitted
using a fast “cyclic co-ordinate descent” (CCD) algorithm.
The control
argument is a list in which any of the following
named components will override the default optimization algorithm
settings (as they are defined by
fit_glmpca_pois_control_default
). Additional control
arguments not listed here can be used to control the behaviour of
fpiter
or daarem
; see
the help accompanying these functions for details.
use_daarem
If use_daarem = TRUE
, the updates
are accelerated using DAAREM; see daarem
for
details.
tol
This is the value of the “tol” control
argument for fpiter
or
daarem
that determines when to stop the
optimization. In brief, the optimization stops when the change in
the estimates or in the log-likelihood between two successive
updates is less than “tol”.
maxiter
This is the value of the “maxiter”
control argument for fpiter
or
daarem
. In brief, it sets the upper limit on
the number of CCD updates.
convtype
This is the value of the “convtype”
control argument for daarem
. It determines
whether the stopping criterion is based on the change in the
estimates or the change in the log-likelihood between two
successive updates.
mon.tol
This is the value of the “mon.tol”
control argument for daarem
. This setting
determines to what extent the monotonicity condition can be
violated.
training_frac
Fraction of the columns of input data Y
to fit initial model on. If set to 1
(default), the model is fit
by optimizing the parameters on the entire dataset. If set between 0
and 1
, the model is optimized by first fitting a model on a randomly
selected fraction of the columns of Y
, and then projecting the
remaining columns of Y
onto the solution. Setting this to a smaller
value will increase speed but decrease accuracy.
num_projection_ccd_iter
Number of co-ordinate descent updates
be made to elements of V
if and when a subset of Y
is
projected onto U
. Only used if training_frac
is less than
1
.
num_ccd_iter
Number of co-ordinate descent updates to be made to parameters at each iteration of the algorithm.
line_search
If line_search = TRUE
, a
backtracking line search is performed at each iteration of CCD to
guarantee improvement in the objective (the log-likelihood).
ls_alpha
alpha parameter for backtracking line search. (Should be a number between 0 and 0.5, typically a number near zero.)
ls_beta
beta parameter for backtracking line search controlling the rate at which the step size is decreased. (Should be a number between 0 and 0.5.)
calc_deriv
If calc_deriv = TRUE
, the maximum
gradient of and
is calculated and stored after each
update. This may be useful for assessing convergence of the
optimization, though increases overhead.
calc_max_diff
If calc_max_diff = TRUE
, the
largest change in and
after each update is
calculated and stored. This may be useful for monitoring progress
of the optimization algorithm.
orthonormalize
If orthonormalize = TRUE
, the
matrices and
are made to be orthogonal after each
update step. This improves the speed of convergence without the
DAAREM acceleration; however, should not be used when
use_daarem = TRUE
.
You may use function set_fastglmpca_threads
to adjust
the number of threads used in performing the updates.
An object capturing the state of the model fit. It contains
estimates of ,
and
(stored as matrices
U
, V
and a vector of diagonal entries d
,
analogous to the svd
return value); the other
parameters (,
,
,
); the log-likelihood
achieved (
loglik
); information about which columns of
and
are fixed (
fixed_b_cols
,
fixed_w_cols
); and a data frame progress
storing
information about the algorithm's progress after each update.
Townes, F. W., Hicks, S. C., Aryee, M. J. and Irizarry, R. A. (2019). Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biology 20, 295. doi:10.1186/s13059-019-1861-6
Collins, M., Dasgupta, S. and Schapire, R. E. (2002). A generalization of principal components analysis to the exponential family. In Advances in Neural Information Processing Systems 14.
set.seed(1) n <- 200 p <- 100 K <- 3 dat <- generate_glmpca_data_pois(n,p,K) fit0 <- init_glmpca_pois(dat$Y,K) fit <- fit_glmpca_pois(dat$Y,fit0 = fit0)
set.seed(1) n <- 200 p <- 100 K <- 3 dat <- generate_glmpca_data_pois(n,p,K) fit0 <- init_glmpca_pois(dat$Y,K) fit <- fit_glmpca_pois(dat$Y,fit0 = fit0)
fitted
method for the
“glmpca_pois_fit” class.
## S3 method for class 'glmpca_pois_fit' fitted(object, ...)
## S3 method for class 'glmpca_pois_fit' fitted(object, ...)
object |
An object of class “glmpca_fit”,
typically the result of calling |
... |
Additional arguments passed to the generic
|
An n x p matrix of fitted means. Calculated as
using the fit
object.
Generate data from a GLM-PCA model with a specified rank.
generate_glmpca_data_pois(n, p, K, link = c("log", "log1p"))
generate_glmpca_data_pois(n, p, K, link = c("log", "log1p"))
n |
Number of rows (genes). |
p |
Number of columns (cells). |
K |
Rank of the underlying mean structure. |
link |
Character vector describing the link between the product of the loading and factors and the mean of the data. |
This function assumes that each column of the data is generated from a multinomial distribution. Let
denote column j of the generated data matrix. First, we set
equal to a value generated from a
distribution. Then, we generate
and
from mixture distributions, and calculate
. Then, we generate the individual elements of
from a multinomial model where the probability for each individual element is just
normalized.
list with the following components
LL - loadings of underlying mean structure. A K x n matrix
FF - factors of underlying mean structure. A K x p matrix
Y - n x p matrix of generated data.
set.seed(1) sim_data <- generate_glmpca_data_pois(1000, 500, 1)
set.seed(1) sim_data <- generate_glmpca_data_pois(1000, 500, 1)
These data are a selection of the reference transcriptome profiles generated via single-cell RNA sequencing (RNA-seq) of 10 bead-enriched subpopulations of PBMCs (Donor A), described in Zheng et al (2017). The data are unique molecular identifier (UMI) counts for 16,791 genes in 3,774 cells. (Genes with no expression in any of the cells were removed.) Since the majority of the UMI counts are zero, they are efficiently stored as a 16,791 x 3774 sparse matrix. These data are used in the vignette illustrating how ‘fastglmpca’ can be used to analyze single-cell RNA-seq data. Data for a separate set of 1,000 cells is provided as a “test set” to evaluate out-of-sample predictions.
pbmc_facs
is a list with the following elements:
16,791 x 3,774 sparse matrix of UMI counts, with
rows corresponding to genes and columns corresponding to
cells (samples). It is an object of class "dgCMatrix"
).
UMI counts for an additional test set of 100 cells.
Data frame containing information about the samples, including cell barcode and source FACS population (“celltype” and “facs_subpop”).
Sample information for the additional test set of 100 cells.
Data frame containing information and the genes, including gene symbol and Ensembl identifier.
GLM-PCA model that was fit to the UMI count data in the vignette.
https://www.10xgenomics.com/resources/datasets
G. X. Y. Zheng et al (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049. doi:10.1038/ncomms14049
library(Matrix) data(pbmc_facs) cat(sprintf("Number of genes: %d\n",nrow(pbmc_facs$counts))) cat(sprintf("Number of cells: %d\n",ncol(pbmc_facs$counts))) cat(sprintf("Proportion of counts that are non-zero: %0.1f%%.\n", 100*mean(pbmc_facs$counts > 0)))
library(Matrix) data(pbmc_facs) cat(sprintf("Number of genes: %d\n",nrow(pbmc_facs$counts))) cat(sprintf("Number of cells: %d\n",ncol(pbmc_facs$counts))) cat(sprintf("Proportion of counts that are non-zero: %0.1f%%.\n", 100*mean(pbmc_facs$counts > 0)))
Initialize RcppParallel multithreading using a
pre-specified number of threads, or using the default number of
threads when n
is not specified or is NA.
set_fastglmpca_threads(n)
set_fastglmpca_threads(n)
n |
The requested number of threads. |
The number of threads to be used.
summary
method for objects of class
“glmpcan_fit_pois”.
## S3 method for class 'glmpca_pois_fit' summary(object, ...) ## S3 method for class 'summary.glmpca_pois_fit' print(x, ...)
## S3 method for class 'glmpca_pois_fit' summary(object, ...) ## S3 method for class 'summary.glmpca_pois_fit' print(x, ...)
object |
An object of class “glmpca_fit”,
typically the result of calling |
... |
Additional arguments passed to the generic
|
x |
An object of class “summary.glmpca_fit”,
usually the result of a call to |
summary
returns a vector of basic statistics
summarizing the model fit.