Title: | Fast Algorithms for Fitting Topic Models and Non-Negative Matrix Factorizations to Count Data |
---|---|
Description: | Implements fast, scalable optimization algorithms for fitting topic models ("grade of membership" models) and non-negative matrix factorizations to count data. The methods exploit the special relationship between the multinomial topic model (also, "probabilistic latent semantic indexing") and Poisson non-negative matrix factorization. The package provides tools to compare, annotate and visualize model fits, including functions to efficiently create "structure plots" and identify key features in topics. The 'fastTopics' package is a successor to the 'CountClust' package. For more information, see <doi:10.48550/arXiv.2105.13440> and <doi:10.1186/s13059-023-03067-9>. Please also see the GitHub repository for additional vignettes not included in the package on CRAN. |
Authors: | Peter Carbonetto [aut, cre], Kevin Luo [aut], Kushal Dey [aut], Joyce Hsiao [ctb], Abhishek Sarkar [ctb], Anthony Hung [ctb], Xihui Lin [ctb], Paul C. Boutros [ctb], Minzhe Wang [ctb], Tracy Ke [ctb], Eric Weine [ctb], Matthew Stephens [aut] |
Maintainer: | Peter Carbonetto <[email protected]> |
License: | BSD_2_clause + file LICENSE |
Version: | 0.6-193 |
Built: | 2024-11-05 16:29:42 UTC |
Source: | https://github.com/stephenslab/fasttopics |
Create a table summarizing the results of fitting one or more Poisson non-negative matrix factorizations or multinomial topic models.
compare_fits(fits)
compare_fits(fits)
fits |
An object of class |
A data frame with one row per element of fits
, and
with the following columns:
k |
The rank of the matrix factorization. |
loglik |
The log-likelihood (either Poisson NMF or multinomial topic model likelihood) achieved at the last model fitting update. |
dev |
For Poisson NMF model fits only, the deviance achieved at the last model fitting update. |
res |
The maximum residual of the Karush-Kuhn-Tucker (KKT) system achieved at the last model fitting update; small values indicate that the solution is close to a local maximum, or stationary point, of the likelihood. |
loglik.diff |
The improvement in the log-likelihood relative to the model fit with the smallest log-likelihood. |
dev.diff |
The improvement in the deviance relative to the model fit with the largest deviance. |
nonzeros.f |
The rate of nonzeros in the factors matrix, as
determined by |
nonzeros.l |
The rate of nonzeros in the loadings matrix, as
determined by |
numiter |
The number of loadings and/or factor updates performed. |
runtime |
The total runtime (in s) of the model fitting updates. |
fit_poisson_nmf
,
fit_topic_model
Implements methods for differential expression analysis using a topic model. These methods are motivated by gene expression studies, but could have other uses, such as identifying “key words” for topics.
de_analysis( fit, X, s = rowSums(X), pseudocount = 0.01, fit.method = c("scd", "em", "mu", "ccd", "glm"), shrink.method = c("ash", "none"), lfc.stat = "le", control = list(), verbose = TRUE, ... ) de_analysis_control_default()
de_analysis( fit, X, s = rowSums(X), pseudocount = 0.01, fit.method = c("scd", "em", "mu", "ccd", "glm"), shrink.method = c("ash", "none"), lfc.stat = "le", control = list(), verbose = TRUE, ... ) de_analysis_control_default()
fit |
An object of class “poisson_nmf_fit” or
“multinom_topic_model_fit”, or an n x k matrix of topic
proportions, where k is the number of topics. (The elements in each
row of this matrix should sum to 1.) If a Poisson NMF fit is
provided as input, the corresponding multinomial topic model fit is
automatically recovered using |
X |
The n x m counts matrix. It can be a sparse matrix (class
|
s |
A numeric vector of length n determining how the rates are
scaled in the Poisson models. See “Details” for guidance on
the choice of |
pseudocount |
Observations with this value are added to the counts matrix to stabilize maximum-likelihood estimation. |
fit.method |
Method used to fit the Poisson models. Note that
|
shrink.method |
Method used to stabilize the posterior mean
LFC estimates. When |
lfc.stat |
The log-fold change statistics returned:
|
control |
A list of parameters controlling behaviour of the optimization and Monte Carlo algorithms. See ‘Details’. |
verbose |
When |
... |
When |
The methods are based on the Poisson model
in which the Poisson rates are
the
are the topic proportions and the
are the unknowns to be
estimated. This model is applied separately to each column of
X
. When (specified by input argument
s
) is
equal the total count in row i (this is the default), the Poisson
model will closely approximate a binomial model of the count data,
and the unknowns will approximate binomial
probabilities. (The Poisson approximation to the binomial is most
accurate when the total counts
rowSums(X)
are large and the
unknowns are small.) Other choices for
s
are
possible, and implement different normalization schemes.
To allow for some flexibility, de_analysis
allows for the
log-fold change to be measured in several ways.
One option is to compare against the probability under the null
model: , where
is the single
parameter in the Poisson model
with
rates
. This LFC definition is chosen with
lfc.stat = "vsnull"
.
Another option is to compare against a chosen topic, k: . By definition,
is zero, and
statistics such as z-scores and p-values for topic k are set to
NA
. This LFC definition is selected by setting
lfc.stat = k
.
A final option (which is the default) computes the “least
extreme” LFC, defined as such that
is the topic other than
that gives the ratio
closest to 1. This option is chosen with
lfc.stat = "le"
.
We recommend setting shrink.method = "ash"
, which uses the
“adaptive shrinkage” method (Stephens, 2016) to improve
accuracy of the posterior mean estimates and z-scores. We follow
the settings used in lfcShrink
from the ‘DESeq2’
package, with type = "ashr"
.
Note that all LFC statistics are defined using the base-2 logarithm following the conventioned used in differential expression analysis.
The control
argument is a list in which any of the
following named components will override the default optimization
algorithm settings (as they are defined by
de_analysis_control_default
):
numiter
Maximum number of iterations performed in
fitting the Poisson models. When fit.method = "glm"
, this is
passed as argument maxit
to the glm
function.
minval
A small, positive number. All topic
proportions less than this value and greater than 1 - minval
are set to this value.
tol
Controls the convergence tolerance for fitting
the Poisson models. When fit.method = "glm"
, this is passed
as argument epsilon
to function glm
.
conf.level
The size of the highest posterior density (HPD) intervals. Should be a number greater than 0 and less than 1.
ns
Number of Monte Carlo samples simulated by random-walk MCMC for estimating posterior LFC quantities.
rw
The standard deviation of the normal density used to propose new states in the random-walk MCMC.
eps
A small, non-negative number added to the terms inside the logarithms to avoid computing logarithms of zero.
nc
Number of threads used in the multithreaded
computations. This controls both (a) the number of RcppParallel
threads used to fit the factors in the Poisson models, and (b)
the number of cores used in mclapply
for
the MCMC simulation step. Note that mclapply relies on forking
hence is not available on Windows; will return an error on
Windows unless nc = 1
. Also note that setting nc >
1
copies the contents of memory nc
times, which can lead
to poor performance if the total resident memory required exceeds
available physical memory.
nc.blas
Number of threads used in the numerical linear algebra library (e.g., OpenBLAS), if available. For best performance, we recommend setting this to 1 (i.e., no multithreading).
nsplit
The number of data splits used in the
multithreaded computations (only relevant when nc > 1
). More
splits increase the granularity of the progress bar, but can also
slow down the mutithreaded computations by introducing more
overhead in the call to pblapply
.
A list with the following elements:
est |
The log-fold change maximum-likelihood estimates. |
postmean |
Posterior mean LFC estimates. |
lower |
Lower limits of estimated HPD intervals. Note that these are not updated by the shrinkage step. |
upper |
Upper limits of estimated HPD intervals. Note that these are not updated by the shrinkage step. |
z |
z-scores for posterior mean LFC estimates. |
lpval |
-log10 two-tailed p-values obtained from the
z-scores. When |
svalue |
s-values returned by
|
lfsr |
When |
ash |
When |
F |
Maximum-likelihood estimates of the Poisson model parameters. |
f0 |
Maximum-likelihood estimates of the null model parameters. |
ar |
A vector containing the Metropolis acceptance ratios from each MCMC run. |
Stephens, M. (2016). False discovery rates: a new deal. Biostatistics 18(2), kxw041. doi:10.1093/biostatistics/kxw041
Zhu, A., Ibrahim, J. G. and Love, M. I. (2019). Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics 35(12), 2084–2092.
# Perform a differential expression (DE) analysis using the previously # fitted multinomial topic model. Note that the de_analysis call could # take several minutes to complete. set.seed(1) data(pbmc_facs) de <- de_analysis(pbmc_facs$fit,pbmc_facs$counts) # Compile the DE analysis results for topic 4 into a table, and # rank the genes by the posterior mean log-fold change, limiting to # DE genes identified with low lfsr ("local false sign rate"). k <- 4 dat <- data.frame(postmean = de$postmean[,k], z = de$z[,k], lfsr = de$lfsr[,k]) rownames(dat) <- with(pbmc_facs$genes,paste(symbol,ensembl,sep = "_")) dat <- subset(dat,lfsr < 0.01) dat <- dat[order(dat$postmean,decreasing = TRUE),] # Genes at the top of this ranking are genes that are much more # highly expressed in the topic compared to other topics. head(dat,n = 10) # The genes at the bottom of the ranking are genes that are much less # expressed in the topic. tail(dat,n = 10) # Create a volcano plot from the DE results for topic 4. volcano_plot(de,k = k,ymax = 50,labels = pbmc_facs$genes$symbol)
# Perform a differential expression (DE) analysis using the previously # fitted multinomial topic model. Note that the de_analysis call could # take several minutes to complete. set.seed(1) data(pbmc_facs) de <- de_analysis(pbmc_facs$fit,pbmc_facs$counts) # Compile the DE analysis results for topic 4 into a table, and # rank the genes by the posterior mean log-fold change, limiting to # DE genes identified with low lfsr ("local false sign rate"). k <- 4 dat <- data.frame(postmean = de$postmean[,k], z = de$z[,k], lfsr = de$lfsr[,k]) rownames(dat) <- with(pbmc_facs$genes,paste(symbol,ensembl,sep = "_")) dat <- subset(dat,lfsr < 0.01) dat <- dat[order(dat$postmean,decreasing = TRUE),] # Genes at the top of this ranking are genes that are much more # highly expressed in the topic compared to other topics. head(dat,n = 10) # The genes at the bottom of the ranking are genes that are much less # expressed in the topic. tail(dat,n = 10) # Create a volcano plot from the DE results for topic 4. volcano_plot(de,k = k,ymax = 50,labels = pbmc_facs$genes$symbol)
Visualize the structure of the Poisson NMF loadings or
the multinomial topic model topic proportions by projection onto
a 2-d surface. plot_hexbin_plot
is most useful for
visualizing the PCs of a data set with thousands of samples or
more.
embedding_plot_2d( fit, Y, fill = "loading", k, fill.label, ggplot_call = embedding_plot_2d_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots) ) embedding_plot_2d_ggplot_call( Y, fill, fill.type = c("loading", "numeric", "factor", "none"), fill.label, font.size = 9 ) pca_plot( fit, Y, pcs = 1:2, n = 10000, fill = "loading", k, fill.label, ggplot_call = embedding_plot_2d_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots), ... ) tsne_plot( fit, Y, n = 2000, fill = "loading", k, fill.label, ggplot_call = embedding_plot_2d_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots), ... ) umap_plot( fit, Y, n = 2000, fill = "loading", k, fill.label, ggplot_call = embedding_plot_2d_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots), ... ) pca_hexbin_plot( fit, Y, pcs = 1:2, bins = 40, breaks = c(0, 1, 10, 100, 1000, Inf), ggplot_call = pca_hexbin_plot_ggplot_call, ... ) pca_hexbin_plot_ggplot_call(Y, bins, breaks, font.size = 9)
embedding_plot_2d( fit, Y, fill = "loading", k, fill.label, ggplot_call = embedding_plot_2d_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots) ) embedding_plot_2d_ggplot_call( Y, fill, fill.type = c("loading", "numeric", "factor", "none"), fill.label, font.size = 9 ) pca_plot( fit, Y, pcs = 1:2, n = 10000, fill = "loading", k, fill.label, ggplot_call = embedding_plot_2d_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots), ... ) tsne_plot( fit, Y, n = 2000, fill = "loading", k, fill.label, ggplot_call = embedding_plot_2d_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots), ... ) umap_plot( fit, Y, n = 2000, fill = "loading", k, fill.label, ggplot_call = embedding_plot_2d_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots), ... ) pca_hexbin_plot( fit, Y, pcs = 1:2, bins = 40, breaks = c(0, 1, 10, 100, 1000, Inf), ggplot_call = pca_hexbin_plot_ggplot_call, ... ) pca_hexbin_plot_ggplot_call(Y, bins, breaks, font.size = 9)
fit |
An object of class “poisson_nmf_fit” or “multinom_topic_model_fit”. |
Y |
The n x 2 matrix containing the 2-d embedding, where n is
the number of rows in |
fill |
The quantity to map onto the fill colour of the points
in the PCA plot. Set |
k |
The dimensions or topics selected by number or name. When
|
fill.label |
The label used for the fill colour legend. |
ggplot_call |
The function used to create the plot. Replace
|
plot_grid_call |
When |
fill.type |
The type of variable mapped to fill colour. The
fill colour is not varied when |
font.size |
Font size used in plot. |
pcs |
The two principal components (PCs) to be plotted, specified by name or number. |
n |
The maximum number of points to plot. If |
... |
Additional arguments passed to
|
bins |
Number of bins used to create hexagonal 2-d
histogram. Passed as the “bins” argument to
|
breaks |
To produce the hexagonal histogram, the counts are
subdivided into intervals based on |
This is a lightweight interface primarily intended to
expedite creation of plots for visualizing the loadings or topic
proportions; most of the heavy lifting is done by
‘ggplot2’. The 2-d embedding itself is computed by invoking
pca_from_topics
, tsne_from_topics
or
umap_from_topics
. For more control over the plot's
appearance, the plot can be customized by modifying the
ggplot_call
and plot_grid_call
arguments.
An effective 2-d visualization may also require some fine-tunning
of the settings, such as the t-SNE “perplexity”, or the
number of samples included in the plot. The PCA, UMAP, t-SNE
settings can be controlled by the additional arguments
(...). Alternatively, a 2-d embedding may be pre-computed, and
passed as argument Y
.
A ggplot
object.
pca_from_topics
,
tsne_from_topics
,
umap_from_topics
set.seed(1) data(pbmc_facs) # Get the Poisson NMF and multinomial topic models fitted to the # PBMC data. fit1 <- multinom2poisson(pbmc_facs$fit) fit2 <- pbmc_facs$fit # Plot the first two PCs of the loadings matrix (for the # multinomial topic model, "fit2", the loadings are the topic # proportions). subpop <- pbmc_facs$samples$subpop p1 <- pca_plot(fit1,k = 1) p2 <- pca_plot(fit2) p3 <- pca_plot(fit2,fill = "none") p4 <- pca_plot(fit2,pcs = 3:4,fill = "none") p5 <- pca_plot(fit2,fill = fit2$L[,1]) p6 <- pca_plot(fit2,fill = subpop) p7 <- pca_hexbin_plot(fit1) p8 <- pca_hexbin_plot(fit2) # Plot the loadings using t-SNE. p1 <- tsne_plot(fit1,k = 1) p2 <- tsne_plot(fit2) p3 <- tsne_plot(fit2,fill = subpop) # Plot the loadings using UMAP. p1 <- umap_plot(fit1,k = 1) p2 <- umap_plot(fit2) p3 <- umap_plot(fit2,fill = subpop)
set.seed(1) data(pbmc_facs) # Get the Poisson NMF and multinomial topic models fitted to the # PBMC data. fit1 <- multinom2poisson(pbmc_facs$fit) fit2 <- pbmc_facs$fit # Plot the first two PCs of the loadings matrix (for the # multinomial topic model, "fit2", the loadings are the topic # proportions). subpop <- pbmc_facs$samples$subpop p1 <- pca_plot(fit1,k = 1) p2 <- pca_plot(fit2) p3 <- pca_plot(fit2,fill = "none") p4 <- pca_plot(fit2,pcs = 3:4,fill = "none") p5 <- pca_plot(fit2,fill = fit2$L[,1]) p6 <- pca_plot(fit2,fill = subpop) p7 <- pca_hexbin_plot(fit1) p8 <- pca_hexbin_plot(fit2) # Plot the loadings using t-SNE. p1 <- tsne_plot(fit1,k = 1) p2 <- tsne_plot(fit2) p3 <- tsne_plot(fit2,fill = subpop) # Plot the loadings using UMAP. p1 <- umap_plot(fit1,k = 1) p2 <- umap_plot(fit2) p3 <- umap_plot(fit2,fill = subpop)
Fit a simple multinomial model for count data, in
which each sample (i.e., a row of the data matrix X
)
is assigned to a cluster. Under this simple multinomial model,
assigned to cluster
is multinomial with sample
size
and multinomial
probabilities
. This is a special case of
the multinomial topic model in which all the mixture proportions
are either 0 or 1. The maximum-likelihood estimates (MLEs) of the
multinomial probabilities have a closed-form solution; no
iterative algorithm is needed to fit this simple model.
fit_multinom_model(cluster, X, verbose = c("none", "detailed"), ...)
fit_multinom_model(cluster, X, verbose = c("none", "detailed"), ...)
cluster |
A factor specifying a grouping, or clustering, of
the rows of |
X |
The n x m matrix of counts; all entries of X should be
non-negative. It can be a sparse matrix (class |
verbose |
This is passed as the “verbose” argument in
the call to |
... |
Additional arguments passed to
|
A multinomial topic model fit.
Approximate the input matrix X
by the
non-negative matrix factorization tcrossprod(L,F)
, in which
the quality of the approximation is measured by a
“divergence” criterion; equivalently, optimize the
likelihood under a Poisson model of the count data, X
, in
which the Poisson rates are given by tcrossprod(L,F)
.
Function fit_poisson_nmf
runs a specified number of
coordinate-wise updates to fit the L and F matrices.
fit_poisson_nmf( X, k, fit0, numiter = 100, update.factors = seq(1, ncol(X)), update.loadings = seq(1, nrow(X)), method = c("scd", "em", "mu", "ccd"), init.method = c("topicscore", "random"), control = list(), verbose = c("progressbar", "detailed", "none") ) fit_poisson_nmf_control_default() init_poisson_nmf( X, F, L, k, init.method = c("topicscore", "random"), beta = 0.5, betamax = 0.99, control = list(), verbose = c("detailed", "none") ) init_poisson_nmf_from_clustering(X, clusters, ...)
fit_poisson_nmf( X, k, fit0, numiter = 100, update.factors = seq(1, ncol(X)), update.loadings = seq(1, nrow(X)), method = c("scd", "em", "mu", "ccd"), init.method = c("topicscore", "random"), control = list(), verbose = c("progressbar", "detailed", "none") ) fit_poisson_nmf_control_default() init_poisson_nmf( X, F, L, k, init.method = c("topicscore", "random"), beta = 0.5, betamax = 0.99, control = list(), verbose = c("detailed", "none") ) init_poisson_nmf_from_clustering(X, clusters, ...)
X |
The n x m matrix of counts; all entries of X should be
non-negative. It can be a sparse matrix (class |
k |
An integer 2 or greater giving the matrix rank. This
argument should only be specified if the initial fit ( |
fit0 |
The initial model fit. It should be an object of class
“poisson_nmf_fit”, such as an output from
|
numiter |
The maximum number of updates of the factors and loadings to perform. |
update.factors |
A numeric vector specifying which factors
(rows of |
update.loadings |
A numeric vector specifying which loadings
(rows of |
method |
The method to use for updating the factors and
loadings. Four methods are implemented: multiplicative updates,
|
init.method |
The method used to initialize the factors and
loadings. When |
control |
A list of parameters controlling the behaviour of the optimization algorithm (and the Topic SCORE algorithm if it is used to initialize the model parameters). See ‘Details’. |
verbose |
When |
F |
An optional argument giving is the initial estimate of the
factors (also known as “basis vectors”). It should be an m x
k matrix, where m is the number of columns in the counts matrix
|
L |
An optional argument giving the initial estimate of the
loadings (also known as “activations”). It should be an n x k
matrix, where n is the number of rows in the counts matrix
|
beta |
Initial setting of the extrapolation parameter. This is
|
betamax |
Initial setting for the upper bound on the
extrapolation parameter. This is |
clusters |
A factor specifying a grouping, or clustering, of
the rows of |
... |
Additional arguments passed to |
In Poisson non-negative matrix factorization (Lee & Seung,
2001), counts in the
matrix,
,
are modeled by the Poisson distribution:
Each Poisson rate,
, is a linear combination of parameters
to be fitted to the data:
in which
is a user-specified tuning parameter specifying the rank of the
matrix factorization. Function
fit_poisson_nmf
computes
maximum-likelihood estimates (MLEs) of the parameters. For
additional mathematical background, and an explanation of how
Poisson NMF is connected to topic modeling, see the vignette:
vignette(topic = "relationship",package = "fastTopics")
.
Using this function requires some care; only minimal argument checking is performed, and error messages may not be helpful.
The EM and multiplicative updates are simple and fast, but can be
slow to converge to a stationary point. When control$numiter
= 1
, the EM and multiplicative updates are mathematically
equivalent to the multiplicative updates, and therefore share the
same convergence properties. However, the implementation of the EM
updates is quite different; in particular, the EM updates are more
suitable for sparse counts matrices. The implementation of the
multiplicative updates is adapted from the MATLAB code by Daichi
Kitamura http://d-kitamura.net.
Since the multiplicative updates are implemented using standard matrix operations, the speed is heavily dependent on the BLAS/LAPACK numerical libraries used. In particular, using optimized implementations such as OpenBLAS or Intel MKL can result in much improved performance of the multiplcative updates.
The cyclic co-ordinate descent (CCD) and sequential co-ordinate descent (SCD) updates adopt the same optimization strategy, but differ in the implementation details. In practice, we have found that the CCD and SCD updates arrive at the same solution when initialized “sufficiently close” to a stationary point. The CCD implementation is adapted from the C++ code developed by Cho-Jui Hsieh and Inderjit Dhillon, which is available for download at https://www.cs.utexas.edu/~cjhsieh/nmf/. The SCD implementation is based on version 0.4-3 of the ‘NNLM’ package.
An additional re-scaling step is performed after each update to promote numerical stability.
We use three measures of progress for the model fitting: (1)
improvement in the log-likelihood (or deviance), (2) change in the
model parameters, and (3) the residuals of the Karush-Kuhn-Tucker
(KKT) first-order conditions. As the iterates approach a stationary
point of the loss function, the change in the model parameters
should be small, and the residuals of the KKT system should vanish.
Use plot_progress
to plot the improvement in the
solution over time.
See fit_topic_model
for additional guidance on model
fitting, particularly for large or complex data sets.
The control
argument is a list in which any of the
following named components will override the default optimization
algorithm settings (as they are defined by
fit_poisson_nmf_control_default
):
numiter
Number of “inner loop” iterations to
run when performing and update of the factors or loadings. This
must be set to 1 for method = "mu"
and method =
"ccd"
.
nc
Number of RcppParallel threads to use for the
updates. When nc
is NA
, the number of threads is
determined by calling
defaultNumThreads
. This setting is
ignored for the multiplicative upates (method = "mu"
).
nc.blas
Number of threads used in the numerical linear algebra library (e.g., OpenBLAS), if available. For best performance, we recommend setting this to 1 (i.e., no multithreading).
min.delta.loglik
Stop performing updates if the
difference in the Poisson NMF log-likelihood between two successive
updates is less than min.delta.loglik
. This should not be
kept at zero when control$extrapolate = TRUE
because the
extrapolated updates are expected to occasionally keep the
likelihood unchanged. Ignored if min.delta.loglik < 0
.
min.res
Stop performing updates if the maximum KKT
residual is less than min.res
. Ignored if min.res < 0
.
minval
A small, positive constant used to safeguard
the multiplicative updates. The safeguarded updates are implemented
as F <- pmax(F1,minval)
and L <- pmax(L1,minval)
,
where F1
and L1
are the factors and loadings matrices
obtained by applying an update. This is motivated by Theorem 1 of
Gillis & Glineur (2012). Setting minval = 0
is allowed, but
some methods are not guaranteed to converge to a stationary point
without this safeguard, and a warning will be given in this case.
extrapolate
When extrapolate = TRUE
, the
extrapolation scheme of Ang & Gillis (2019) is used.
extrapolate.reset
To promote better numerical stability of the extrapolated updates, they are “reset” every so often. This parameter determines the number of iterations to wait before resetting.
beta.increase
When the extrapolated update improves the solution, scale the extrapolation parameter by this amount.
beta.reduce
When the extrapolaaed update does not improve the solution, scale the extrapolation parameter by this amount.
betamax.increase
When the extrapolated update improves the solution, scale the extrapolation parameter by this amount.
eps
A small, non-negative number that is added to the terms inside the logarithms to sidestep computing logarithms of zero. This prevents numerical problems at the cost of introducing a small inaccuracy in the solution. Increasing this number may lead to faster convergence but possibly a less accurate solution.
zero.threshold
A small, non-negative number used to
determine which entries of the solution are exactly zero. Any
entries that are less than or equal to zero.threshold
are
considered to be exactly zero.
An additional setting, control$init.numiter
, controls the
number of sequential co-ordinate descent (SCD) updates that are
performed to initialize the loadings matrix when init.method
= "topicscore"
.
init_poisson_nmf
and fit_poisson_nmf
both
return an object capturing the optimization algorithm state (for
init_poisson_nmf
, this is the initial state). It is a list
with the following elements:
F |
A matrix containing the current best estimates of the factors. |
L |
A matrix containing the current best estimates of the loadings. |
Fn |
A matrix containing the non-extrapolated factor estimates.
If extrapolation is not used, |
Ln |
A matrix containing the non-extrapolated estimates of the
loadings. If extrapolation is not used, |
Fy |
A matrix containing the extrapolated factor estimates. If
the extrapolation scheme is not used, |
Ly |
A matrix containing the extrapolated estimates of the
loadings. If extrapolation is not used, |
loss |
Value of the objective (“loss”) function computed at the current best estimates of the factors and loadings. |
loss.fnly |
Value of the objective (“loss”) function
computed at the extrapolated solution for the loadings ( |
iter |
The number of the most recently completed iteration. |
beta |
The extrapolation parameter, |
betamax |
Upper bound on the extrapolation parameter. This is
|
beta0 |
The setting of the extrapolation parameter at the last iteration that improved the solution. |
progress |
A data frame containing detailed information about
the algorithm's progress. The data frame should have at most
|
Ang, A. and Gillis, N. (2019). Accelerating nonnegative matrix factorization algorithms using extrapolation. Neural Computation 31, 417–439.
Cichocki, A., Cruces, S. and Amari, S. (2011). Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 13, 134–170.
Gillis, N. and Glineur, F. (2012). Accelerated multiplicative
updates and hierarchical ALS algorithms for nonnegative matrix
factorization. Neural Computation 24
, 1085–1105.
Hsieh, C.-J. and Dhillon, I. (2011). Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, p. 1064-1072
Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, 556–562.
Lin, X. and Boutros, P. C. (2018). Optimization and expansion of non-negative matrix factorization. BMC Bioinformatics 21, 7.
Ke, Z. & Wang, M. (2017). A new SVD approach to optimal topic estimation. arXiv https://arxiv.org/abs/1704.07016
fit_topic_model
, plot_progress
# Simulate a (sparse) 80 x 100 counts matrix. library(Matrix) set.seed(1) X <- simulate_count_data(80,100,k = 3,sparse = TRUE)$X # Remove columns (words) that do not appear in any row (document). X <- X[,colSums(X > 0) > 0] # Run 10 EM updates to find a good initialization. fit0 <- fit_poisson_nmf(X,k = 3,numiter = 10,method = "em") # Fit the Poisson NMF model by running 50 EM updates. fit_em <- fit_poisson_nmf(X,fit0 = fit0,numiter = 50,method = "em") # Fit the Poisson NMF model by running 50 extrapolated SCD updates. fit_scd <- fit_poisson_nmf(X,fit0 = fit0,numiter = 50,method = "scd", control = list(extrapolate = TRUE)) # Compare the two fits. fits <- list(em = fit_em,scd = fit_scd) compare_fits(fits) plot_progress(fits,y = "loglik") plot_progress(fits,y = "res") # Recover the topic model. After this step, the L matrix contains the # mixture proportions ("loadings"), and the F matrix contains the # word frequencies ("factors"). fit_multinom <- poisson2multinom(fit_scd)
# Simulate a (sparse) 80 x 100 counts matrix. library(Matrix) set.seed(1) X <- simulate_count_data(80,100,k = 3,sparse = TRUE)$X # Remove columns (words) that do not appear in any row (document). X <- X[,colSums(X > 0) > 0] # Run 10 EM updates to find a good initialization. fit0 <- fit_poisson_nmf(X,k = 3,numiter = 10,method = "em") # Fit the Poisson NMF model by running 50 EM updates. fit_em <- fit_poisson_nmf(X,fit0 = fit0,numiter = 50,method = "em") # Fit the Poisson NMF model by running 50 extrapolated SCD updates. fit_scd <- fit_poisson_nmf(X,fit0 = fit0,numiter = 50,method = "scd", control = list(extrapolate = TRUE)) # Compare the two fits. fits <- list(em = fit_em,scd = fit_scd) compare_fits(fits) plot_progress(fits,y = "loglik") plot_progress(fits,y = "res") # Recover the topic model. After this step, the L matrix contains the # mixture proportions ("loadings"), and the F matrix contains the # word frequencies ("factors"). fit_multinom <- poisson2multinom(fit_scd)
Fits a multinomial topic model to the count data,
hiding most of the complexities of model fitting. The default
optimization settings used here are intended to work well in a wide
range of data sets, although some fine-tuning may be needed for
more difficult cases. For full control, use fit_poisson_nmf
.
fit_topic_model( X, k, numiter.main = 100, numiter.refine = 100, method.main = "em", method.refine = "scd", init.method = c("topicscore", "random"), control.init = list(), control.main = list(numiter = 4), control.refine = list(numiter = 4, extrapolate = TRUE), verbose = c("progressbar", "detailed", "none") )
fit_topic_model( X, k, numiter.main = 100, numiter.refine = 100, method.main = "em", method.refine = "scd", init.method = c("topicscore", "random"), control.init = list(), control.main = list(numiter = 4), control.refine = list(numiter = 4, extrapolate = TRUE), verbose = c("progressbar", "detailed", "none") )
X |
The n x m matrix of counts; all entries of X should be
non-negative. It can be a sparse matrix (class |
k |
The number of topics. Must be 2 or greater. |
numiter.main |
Number of updates of the factors and loadings to perform in the main model fitting step. Should be 1 or more. |
numiter.refine |
Number of updates of the factors and loadings to perform in the model refinement step. |
method.main |
The method to use for updating the factors and
loadings in the main model fitting step. Passed as argument
"method" to |
method.refine |
The method to use for updating the factors in
evthe model refinement step. Passed as argument "method"
to |
init.method |
The method used to initialize the factors and
loadings. See |
control.init |
A list of parameters controlling the behaviour
of the optimization and Topic SCORE method in the call to
|
control.main |
A list of parameters controlling the behaviour
of the optimization in the main model fitting step. This is passed
as argument "control" to |
control.refine |
A list of parameters controlling the
behaviour of the of the optimization algorithm in the model
refinement step. This is passed as argument "control" to
|
verbose |
When |
With the default settings, the model fitting is
accomplished in four steps: (1) initialize the Poisson NMF model
fit (init_poisson_nmf
); (2) perform the main model fitting
step by running 100 EM updates using fit_poisson_nmf
; (3)
refine the fit by running 100 extrapolated SCD updates, again using
fit_poisson_nmf
; and (4) recover the multinomial topic model
by calling poisson2multinom
.
This two-stage fitting approach is based on our findings that the EM algorithm initially makes rapid progress toward a solution, but its convergence slows considerably as the iterates approach a solution. Close to a solution, we have found that other algorithms make much more rapid progress than EM; in particularly, we found that the extrapolated SCD updates usually performed best). For larger data sets, more updates in the main model fitting and refinement steps may be needed to obtain a good fit.
For larger data sets, more than 200 updates may be needed to obtain a good fit.
A multinomial topic model fit; see
poisson2multinom
and fit_poisson_nmf
for details. Note that outputted likelihoods and deviances in
progress
are evaluated with respect to the equivalent
Poisson NMF model.
Dey, K. K., Hsiao, C. J. and Stephens, M. (2017). Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genetics 13, e1006599.
Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993-1022.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd International ACM SIGIR Conference, 50-57. doi:10.1145/312624.312649
init_poisson_nmf
,
fit_poisson_nmf
,
poisson2multinom
,
fit_multinom_model
library(Matrix) set.seed(1) X <- simulate_count_data(80,100,k = 3,sparse = TRUE)$X fit <- fit_topic_model(X,k = 3) print(summary(fit))
library(Matrix) set.seed(1) X <- simulate_count_data(80,100,k = 3,sparse = TRUE)$X fit <- fit_topic_model(X,k = 3) print(summary(fit))
Generate one or more barcharts to visualize the relationship between the loadings or mixture proportions and a selected categorical variable (a factor).
loadings_plot( fit, x, k, ggplot_call = loadings_plot_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots) ) loadings_plot_ggplot_call(dat, topic.label, font.size = 9)
loadings_plot( fit, x, k, ggplot_call = loadings_plot_ggplot_call, plot_grid_call = function(plots) do.call(plot_grid, plots) ) loadings_plot_ggplot_call(dat, topic.label, font.size = 9)
fit |
An object of class “poisson_nmf_fit” or “multinom_topic_model_fit”. |
x |
A categorical variable represented as a
|
k |
The topic, or topics, selected by number or name. When not specified, all topics are plotted. |
ggplot_call |
The function used to create the plot. Replace
|
plot_grid_call |
When multiple topics are selected, this is
the function used to arrange the plots into a grid using
|
dat |
A data frame passed as input to
|
topic.label |
The name or number of the topic being plotted. Only used to determine the plot title. |
font.size |
Font size used in plot. |
This is a lightweight interface primarily intended to
expedite creation of boxplots for investigating relationships
between topics and a categorical variables of interest without
having to spend a great deal of time worrying about the plotting
settings; most of the “heavy lifting” is done by
‘ggplot2’ (specifically, function
geom_boxplot
in the ‘ggplot2’
package). For more control over the plot's appearance, the plot can
be customized by modifying the ggplot_call
and
plot_grid_call
arguments.
A ggplot
object.
Compute log-likelihoods and deviances for assessing fit of a topic model or a non-negative matrix factorization (NMF).
loglik_poisson_nmf(X, fit, e = 1e-08) loglik_multinom_topic_model(X, fit, e = 1e-08) deviance_poisson_nmf(X, fit, e = 1e-08) cost(X, A, B, e = 1e-08, family = c("poisson", "multinom"), version)
loglik_poisson_nmf(X, fit, e = 1e-08) loglik_multinom_topic_model(X, fit, e = 1e-08) deviance_poisson_nmf(X, fit, e = 1e-08) cost(X, A, B, e = 1e-08, family = c("poisson", "multinom"), version)
X |
The n x m matrix of counts or pseudocounts. It can be a
sparse matrix (class |
fit |
A Poisson NMF or multinomial topic model fit, such as an
output from |
e |
A small, non-negative number added to the terms inside the logarithms to avoid computing logarithms of zero. This prevents numerical problems at the cost of introducing a very small inaccuracy in the computation. |
A |
The n x k matrix of loadings. It should be a dense matrix. |
B |
The k x m matrix of factors. It should be a dense matrix. |
family |
If |
version |
When |
Function cost
computes loss functions proportional
to the negative log-likelihoods, and is mainly for internal use to
quickly compute log-likelihoods and deviances; it should not be
used directly unless you know what you are doing. In particular,
little argument checking is performed by cost
.
A numeric vector with one entry per row of X
.
# Generate a small counts matrix. set.seed(1) out <- simulate_count_data(10,20,3) X <- out$X fit <- out[c("F","L")] class(fit) <- c("poisson_nmf_fit","list") # Compute the Poisson log-likelihoods and deviances. data.frame(loglik = loglik_poisson_nmf(X,fit), deviance = deviance_poisson_nmf(X,fit)) # Compute multinomial log-likelihoods. loglik_multinom_topic_model(X,fit)
# Generate a small counts matrix. set.seed(1) out <- simulate_count_data(10,20,3) X <- out$X fit <- out[c("F","L")] class(fit) <- c("poisson_nmf_fit","list") # Compute the Poisson log-likelihoods and deviances. data.frame(loglik = loglik_poisson_nmf(X,fit), deviance = deviance_poisson_nmf(X,fit)) # Compute multinomial log-likelihoods. loglik_multinom_topic_model(X,fit)
Combine two or more topics in a multinomial topic model fit.
merge_topics(fit, k)
merge_topics(fit, k)
fit |
A multinomial topic model fit. |
k |
The names or numbers of the topics to be combined. Two or more topics should be chosen. |
Mixture proportions are combined by summation, and factors are combined by averaging.
A multinomial topic model fit.
This function recovers parameter estimates of the Poisson non-negative matrix factorization (NMF) given parameter estimates for a multinomial topic model.
multinom2poisson(fit, X)
multinom2poisson(fit, X)
fit |
An object of class “multinom_topic_model_fit”,
such as an output from |
X |
Optional n x m matrix of counts, or pseudocounts. It can
be a sparse matrix (class |
The return value is the list fit
, in which matrices
fit$F
and fit$L
specify the factors and loadings in
the Poisson non-negative matrix factorization; specifically,
the counts matrix is modeled by the low-rank matrix product
tcrossprod(fit$L,fit$F)
.
These data are a selection of the reference transcriptome profiles generated via single-cell RNA sequencing (RNA-seq) of 10 bead-enriched subpopulations of PBMCs (Donor A), described in Zheng et al (2017). The data are unique molecular identifier (UMI) counts for 16,791 genes in 3,774 cells. (Genes with no expression in any of the cells were removed.) Since the majority of the UMI counts are zero, they are efficiently stored as a 3,774 x 16,791 sparse matrix. These data are used in the vignette illustrating how 'fastTopics' can be used to analyze to single-cell RNA-seq data. Data for a separate set of 1,000 cells is provided as a “test set” to evaluate out-of-sample predictions.
pbmc_facs
is a list with the following elements:
3,774 x 16,791 sparse matrix of UMI counts, with
rows corresponding to samples (cells) and columns corresponding to
genes. It is an object of class "dgCMatrix"
).
UMI counts for an additional test set of 100 cells.
Data frame containing information about the samples, including cell barcode and source FACS population (“celltype” and “facs_subpop”).
Sample information for the additional test set of 100 cells.
Data frame containing information and the genes, including gene symbol and Ensembl identifier.
Poisson non-negative matrix factorization (NMF) fitted
to the UMI count data counts
, with rank k = 6
. See
the vignette how the Poisson NMF model fitting was performed.
https://www.10xgenomics.com/resources/datasets
G. X. Y. Zheng et al (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049. doi:10.1038/ncomms14049
library(Matrix) data(pbmc_facs) cat(sprintf("Number of cells: %d\n",nrow(pbmc_facs$counts))) cat(sprintf("Number of genes: %d\n",ncol(pbmc_facs$counts))) cat(sprintf("Proportion of counts that are non-zero: %0.1f%%.\n", 100*mean(pbmc_facs$counts > 0)))
library(Matrix) data(pbmc_facs) cat(sprintf("Number of cells: %d\n",nrow(pbmc_facs$counts))) cat(sprintf("Number of genes: %d\n",ncol(pbmc_facs$counts))) cat(sprintf("Proportion of counts that are non-zero: %0.1f%%.\n", 100*mean(pbmc_facs$counts > 0)))
Lightweight interface for rapidly producing low-dimensional embeddings from matrix factorizations or multinomial topic models. The defaults used are more suitable for producing embeddings from matrix factorizations or topic models.
pca_from_topics(fit, dims = 2, center = TRUE, scale. = FALSE, ...) tsne_from_topics( fit, dims = 2, pca = FALSE, normalize = FALSE, perplexity = 100, theta = 0.1, max_iter = 1000, eta = 200, check_duplicates = FALSE, verbose = TRUE, ... ) umap_from_topics( fit, dims = 2, n_neighbors = 30, metric = "euclidean", scale = "none", pca = NULL, verbose = TRUE, ... )
pca_from_topics(fit, dims = 2, center = TRUE, scale. = FALSE, ...) tsne_from_topics( fit, dims = 2, pca = FALSE, normalize = FALSE, perplexity = 100, theta = 0.1, max_iter = 1000, eta = 200, check_duplicates = FALSE, verbose = TRUE, ... ) umap_from_topics( fit, dims = 2, n_neighbors = 30, metric = "euclidean", scale = "none", pca = NULL, verbose = TRUE, ... )
fit |
An object of class “poisson_nmf_fit” or “multinom_topic_model_fit”. |
dims |
The number of dimensions in the embedding. In
|
center |
A logical value indicating whether columns of
|
scale. |
A logical value indicating whether columns of
|
... |
|
pca |
Whether to perform a PCA processing step in t-SNE or
UMAP; passed as argument “pca” to |
normalize |
Whether to normalize the data prior to running
t-SNE; passed as argument “normalize” to
|
perplexity |
t-SNE perplexity parameter, passed as argument
“perplexity” to |
theta |
t-SNE speed/accuracy trade-off parameter; passed as
argument “theta” to |
max_iter |
Maximum number of t-SNE iterations; passed as
argument “max_iter” to |
eta |
t-SNE learning rate parameter; passed as argument
“eta” to |
check_duplicates |
When |
verbose |
If |
n_neighbors |
Number of nearest neighbours in manifold
approximation; passed as argument “n_neighbors” to
|
metric |
Distance matrix used to find nearest neighbors;
passed as argument “metric” to
|
scale |
Scaling to apply to |
Note that since tsne_from_topics
and
umap_from_topics
use nonlinear transformations of the data,
distances between points are generally less interpretable than a
linear transformation obtained by, say, PCA.
An n x d matrix containing the embedding, where n is the
number of rows of fit$L
, and d = dims
.
Kobak, D. and Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications 10, 5416. doi:10.1038/s41467-019-13056-x
pca_plot
, tsne_plot
,
umap_plot
, prcomp
,
Rtsne
, umap
library(ggplot2) library(cowplot) set.seed(1) data(pbmc_facs) # Get the Poisson NMF and multinomial topic model fit to the PBMC data. fit1 <- multinom2poisson(pbmc_facs$fit) fit2 <- pbmc_facs$fit fit2 <- poisson2multinom(fit1) # Compute the first two PCs of the loadings matrix (for the topic # model, fit2, the loadings are the topic proportions). Y1 <- pca_from_topics(fit1) Y2 <- pca_from_topics(fit2) subpop <- pbmc_facs$samples$subpop quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot() quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot() # Compute a 2-d embedding of the loadings using t-SNE. Y1 <- tsne_from_topics(fit1) Y2 <- tsne_from_topics(fit2) quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot() quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot() # Compute a 2-d embedding of the loadings using UMAP. Y1 <- umap_from_topics(fit1) Y2 <- umap_from_topics(fit2) quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot() quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot()
library(ggplot2) library(cowplot) set.seed(1) data(pbmc_facs) # Get the Poisson NMF and multinomial topic model fit to the PBMC data. fit1 <- multinom2poisson(pbmc_facs$fit) fit2 <- pbmc_facs$fit fit2 <- poisson2multinom(fit1) # Compute the first two PCs of the loadings matrix (for the topic # model, fit2, the loadings are the topic proportions). Y1 <- pca_from_topics(fit1) Y2 <- pca_from_topics(fit2) subpop <- pbmc_facs$samples$subpop quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot() quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot() # Compute a 2-d embedding of the loadings using t-SNE. Y1 <- tsne_from_topics(fit1) Y2 <- tsne_from_topics(fit2) quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot() quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot() # Compute a 2-d embedding of the loadings using UMAP. Y1 <- umap_from_topics(fit1) Y2 <- umap_from_topics(fit2) quickplot(Y1[,1],Y1[,2],color = subpop) + theme_cowplot() quickplot(Y2[,1],Y2[,2],color = subpop) + theme_cowplot()
Create a plot showing the improvement in the log-likelihood as the rank of the matrix factorization or the number of topics (“k”) increases.
plot_loglik_vs_rank(fits, ggplot_call = loglik_vs_rank_ggplot_call) loglik_vs_rank_ggplot_call(dat, font.size = 9)
plot_loglik_vs_rank(fits, ggplot_call = loglik_vs_rank_ggplot_call) loglik_vs_rank_ggplot_call(dat, font.size = 9)
fits |
A list with 2 more list elements, in which each list
element is an object of class |
ggplot_call |
The function used to create the plot. Replace
|
dat |
A data frame passed as input to
|
font.size |
Font size used in plot. |
A ggplot
object.
Create a plot showing improvement in one or more Poisson NMF or multinomial topic model fits over time.
plot_progress( fits, x = c("timing", "iter"), y = c("loglik", "dev", "res"), add.point.every = 20, colors = c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7"), linetypes = "solid", linesizes = 0.5, shapes = 19, fills = "white", e = 0.01, theme = function() theme_cowplot(12) )
plot_progress( fits, x = c("timing", "iter"), y = c("loglik", "dev", "res"), add.point.every = 20, colors = c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7"), linetypes = "solid", linesizes = 0.5, shapes = 19, fills = "white", e = 0.01, theme = function() theme_cowplot(12) )
fits |
An object of class |
x |
Choose |
y |
Column of the "progress" data frame used to assess
progress of the Poisson NMF optimization method(s). Should be one
of |
add.point.every |
A positive integer giving the iteration
interval for drawing points on the progress curves. Set to
|
colors |
Colours used to draw progress curves; passed as the
|
linetypes |
Line types used to draw progress curves; passed as
the |
linesizes |
Line sizes used to draw progress curves; passed as
the |
shapes |
Shapes used to draw points at the selected
iterations; passed as the |
fills |
Fill colours used to draw points at the selected
iterations; passed as the |
e |
A small, positive number added to the vertical axis (for
|
theme |
The ‘ggplot2’ “theme”. |
The horizontal axis shows the recorded runtime (in s), and the vertical axis shows some quantity measuring the quality of the fit: the log-likelihood, deviance or maximum residual of the Karush-Kuhn-Tucker (KKT) first-order optimality conditions. To better visualize log-likelihoods and deviances, log-likelihood and deviance differences are shown on the logarithmic scale. Differences are calculated with respect to the best value achieved over all the fits compared.
Note that only minimal argument checking is performed.
A ggplot
object.
This function recovers parameter estimates of the multinomial topic model given parameter estimates for a Poisson non-negative matrix factorization (NMF).
poisson2multinom(fit)
poisson2multinom(fit)
fit |
An object of class “poisson_nmf_fit”, such as an
output from |
The return value is the list fit
, in which
fit$F
and fit$L
are the parameters of the multinomial
topic model; specifically, fit$L[i,]
gives the topic
probabilities for sample or document i, and fit$F[,k]
gives
the term probabilities for topic k. An additional vector
fit$s
of length n is returned giving the "size factors".
Predict loadings based on previously fit Poisson NMF,
or predict topic proportions based on previously fit multinomial
topic model. This can be thought of as projecting data points onto
a previously estimated set of factors fit$F
.
## S3 method for class 'poisson_nmf_fit' predict(object, newdata, numiter = 20, ...) ## S3 method for class 'multinom_topic_model_fit' predict(object, newdata, numiter = 20, ...)
## S3 method for class 'poisson_nmf_fit' predict(object, newdata, numiter = 20, ...) ## S3 method for class 'multinom_topic_model_fit' predict(object, newdata, numiter = 20, ...)
object |
An object of class “poisson_nmf_fit” or “multinom_topic_model_fit”. |
newdata |
An optional counts matrix. If omitted, the loadings estimated in the original data are returned. |
numiter |
The number of updates to perform. |
... |
Additional arguments passed to
|
A loadings matrix with one row for each data point and one
column for each topic or factor. For
predict.multinom_topic_model_fit
, the output can also be
interpreted as a matrix of estimated topic proportions, in which
L[i,j]
is the proportional contribution of topic j to data
point i.
# Simulate a 175 x 1,200 counts matrix. set.seed(1) dat <- simulate_count_data(175,1200,k = 3) # Split the data into training and test sets. train <- dat$X[1:100,] test <- dat$X[101:175,] # Fit a Poisson non-negative matrix factorization using the # training data. fit <- init_poisson_nmf(train,F = dat$F,init.method = "random") fit <- fit_poisson_nmf(train,fit0 = fit) # Compare the estimated loadings in the training data against the # loadings used to simulate these data. Ltrain <- predict(fit) plot(dat$L[1:100,],Ltrain,pch = 20,col = "darkblue") abline(a = 0,b = 1,col = "magenta",lty = "dotted", xlab = "true",ylab = "estimated") # Next, predict loadings in unseen (test) data points, and compare # these predictions against the loadings that were used to simulate # the test data. Ltest <- predict(fit,test) plot(dat$L[101:175,],Ltest,pch = 20,col = "darkblue", xlab = "true",ylab = "estimated") abline(a = 0,b = 1,col = "magenta",lty = "dotted") # Simulate a 175 x 1,200 counts matrix. set.seed(1) dat <- simulate_multinom_gene_data(175,1200,k = 3) # Split the data into training and test sets. train <- dat$X[1:100,] test <- dat$X[101:175,] # Fit a topic model using the training data. fit <- init_poisson_nmf(train,F = dat$F,init.method = "random") fit <- fit_poisson_nmf(train,fit0 = fit) fit <- poisson2multinom(fit) # Compare the estimated topic proportions in the training data against # the topic proportions used to simulate these data. Ltrain <- predict(fit) plot(dat$L[1:100,],Ltrain,pch = 20,col = "darkblue") abline(a = 0,b = 1,col = "magenta",lty = "dotted", xlab = "true",ylab = "estimated") # Next, predict loadings in unseen (test) data points, and compare # these predictions against the loadings that were used to simulate # the test data. Ltest <- predict(fit,test) plot(dat$L[101:175,],Ltest,pch = 20,col = "darkblue", xlab = "true",ylab = "estimated") abline(a = 0,b = 1,col = "magenta",lty = "dotted")
# Simulate a 175 x 1,200 counts matrix. set.seed(1) dat <- simulate_count_data(175,1200,k = 3) # Split the data into training and test sets. train <- dat$X[1:100,] test <- dat$X[101:175,] # Fit a Poisson non-negative matrix factorization using the # training data. fit <- init_poisson_nmf(train,F = dat$F,init.method = "random") fit <- fit_poisson_nmf(train,fit0 = fit) # Compare the estimated loadings in the training data against the # loadings used to simulate these data. Ltrain <- predict(fit) plot(dat$L[1:100,],Ltrain,pch = 20,col = "darkblue") abline(a = 0,b = 1,col = "magenta",lty = "dotted", xlab = "true",ylab = "estimated") # Next, predict loadings in unseen (test) data points, and compare # these predictions against the loadings that were used to simulate # the test data. Ltest <- predict(fit,test) plot(dat$L[101:175,],Ltest,pch = 20,col = "darkblue", xlab = "true",ylab = "estimated") abline(a = 0,b = 1,col = "magenta",lty = "dotted") # Simulate a 175 x 1,200 counts matrix. set.seed(1) dat <- simulate_multinom_gene_data(175,1200,k = 3) # Split the data into training and test sets. train <- dat$X[1:100,] test <- dat$X[101:175,] # Fit a topic model using the training data. fit <- init_poisson_nmf(train,F = dat$F,init.method = "random") fit <- fit_poisson_nmf(train,fit0 = fit) fit <- poisson2multinom(fit) # Compare the estimated topic proportions in the training data against # the topic proportions used to simulate these data. Ltrain <- predict(fit) plot(dat$L[1:100,],Ltrain,pch = 20,col = "darkblue") abline(a = 0,b = 1,col = "magenta",lty = "dotted", xlab = "true",ylab = "estimated") # Next, predict loadings in unseen (test) data points, and compare # these predictions against the loadings that were used to simulate # the test data. Ltest <- predict(fit,test) plot(dat$L[101:175,],Ltest,pch = 20,col = "darkblue", xlab = "true",ylab = "estimated") abline(a = 0,b = 1,col = "magenta",lty = "dotted")
Run HOMER motif finding algorithm
(findMotifsGenome.pl
) to identify motifs enriched for
differentially expressed (DE) genomic positions. See
http://homer.ucsd.edu for more information.
run_homer( de, k, positions, genome = "hg19", subset = function(postmean, lpval, lfsr, rank, quantile) lfsr < 0.05, homer.exec = "findMotifsGenome.pl", out.dir = tempdir(), homer.options = "-len 8,10,12 -size 200 -mis 2 -S 25 -p 1 -h", verbose = TRUE )
run_homer( de, k, positions, genome = "hg19", subset = function(postmean, lpval, lfsr, rank, quantile) lfsr < 0.05, homer.exec = "findMotifsGenome.pl", out.dir = tempdir(), homer.options = "-len 8,10,12 -size 200 -mis 2 -S 25 -p 1 -h", verbose = TRUE )
de |
An object of class “topic_model_de_analysis”,
usually the result of running |
k |
Use the DE analysis results for this topic. |
positions |
A table of genomic positions corresponding to rows
of the |
genome |
The genome parameter passed to
|
subset |
Describe input argument "subset" here. |
homer.exec |
The name or file path of the HOMER
|
out.dir |
The positions BED file and HOMER results are written to this directory. |
homer.options |
Character string used to override default
|
verbose |
When |
A data frame containing the motif enrichment results. It
is created from the knownResults.txt
HOMER output.
Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H. and Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular Cell 38, 576-589.
This function can be used to extract estimates for a subset of the count data, or to re-order the rows of the loadings matrix.
## S3 method for class 'poisson_nmf_fit' select(.data, loadings, ...) ## S3 method for class 'multinom_topic_model_fit' select(.data, loadings, ...) select_loadings(.data, loadings, ...)
## S3 method for class 'poisson_nmf_fit' select(.data, loadings, ...) ## S3 method for class 'multinom_topic_model_fit' select(.data, loadings, ...) select_loadings(.data, loadings, ...)
.data |
Poisson NMF or Multinomial Topic Model fit; that is,
an object of class “poisson_nmf_fit” or
“multinom_topic_model_fit”, such as an output from
|
loadings |
Indices (names or numbers) giving data rows to keep. If not specified, all rows are kept. |
... |
Other arguments passed to the generic select function. |
A Poisson NMF or multinomial topic model fit containing the selected data rows only.
Simulate a counts matrix X
such that
X[i,j]
is Poisson with rate (mean) Y[i,j]
, where
Y = tcrossprod(L,F)
, L
is an n x k loadings
(“activations”) matrix, and F
is an m x k factors
(“basis vectors”) matrix. The entries of matrix L
are
drawn uniformly at random between zero and lmax
, and the
entries of matrix F
are drawn uniformly at random between 0
and fmax
.
simulate_count_data(n, m, k, fmax = 1, lmax = 1, sparse = FALSE)
simulate_count_data(n, m, k, fmax = 1, lmax = 1, sparse = FALSE)
n |
Number of rows in simulated count matrix. The number of rows should be at least 2. |
m |
Number of columns in simulated count matrix. The number of columns should be at least 2. |
k |
Number of factors, or “topics”, used to determine Poisson rates. The number of topics should be 1 or more. |
fmax |
Factors are drawn uniformly at random between zero and
|
lmax |
Loadings are drawn uniformly at random between zero and
|
sparse |
If |
Note that only minimal argument checking is performed. This function is mainly used to simulate small data sets for the examples and package tests.
The return value is a list containing the counts matrix
X
and the factorization, F
and L
, used to
generate the counts.
Simulate count data from a Poisson NMF model or multinomial topic model, in which topics represent “gene expression programs”, and gene expression programs are characterized by different rates of expression. The way in which the counts are simulated is modeled after gene expression studies in which expression is measured by single-cell RNA sequencing (“RNA-seq”) techniques: each row of the counts matrix corresponds a gene expression profile, each column corresponds to a gene, and each matrix element is a “read count”, or “UMI count”, measuring expression level. Factors are simulated so as to capture realistic changes in gene expression across different cell types. See “Details” for the procedure used to simulate factors, loadings and counts.
simulate_poisson_gene_data(n, m, k, s, p = 1, sparse = FALSE) simulate_multinom_gene_data(n, m, k, sparse = FALSE)
simulate_poisson_gene_data(n, m, k, s, p = 1, sparse = FALSE) simulate_multinom_gene_data(n, m, k, sparse = FALSE)
n |
Number of rows in the simulated count matrix. Should be at least 2. |
m |
Number of columns in the simulated count matrix. Should be at least 2. |
k |
Number of factors, or “topics”, used to generate the data. Should be 2 or more. |
s |
Vector of “size factors”; each row of the loadings
matrix |
p |
Probability that |
sparse |
If |
Here we describe the process for generating the n x k
loadings matrix L
and the m x k factors matrix F
.
Each row of the L
matrix is generated in the following
manner: (1) the number of nonzero mixture proportions is , with probability proportional to
;
(2) the indices of the nonzero mixture proportions are sampled
uniformly at random; and (3) the nonzero mixture proportions are
sampled from the Dirichlet distribution with
(so
that all topics are equally likely).
Each row of the factors matrix are generated according to the
following procedure: (1) generate , where
; (2) for each topic
, generate the Poisson rates as
, where
, and
. Factors can be interpreted as
Poisson rates or multinomial probabilities, so that individual
counts can be viewed as being generated from a weighted mixture
of “topics” with different rates or probabilities.
Once the loadings and factors have been generated, the counts are
simulated from either the Poisson NMF or multinomial topic model:
for the former, X[i,j]
is Poisson with rate Y[i,j]
,
where Y = tcrossprod(L,F)
; for the latter, X[i,]
is
multinomial with size s[i]
and with class probabilities
P[i,]
, where P = tcrossprod(L,F)
. For the multinomial
model only, the sizes s
are randomly generated as s =
10^rnorm(n,3,0.2)
.
Note that only minimal argument checking is performed; the function is mainly used to test implementation of the topic-model-based differential count analysis.
simulate_poisson_gene_data
returns a list containing
the counts matrix X
, and the size factors s
and
factorization, F
, L
, used to generate the counts.
simulate_multinom_gene_data
returns a list containing the
counts matrix X
, and the mixture proportions L
and
factors (gene probabilities, or relative gene expression levels)
F
used to generate the counts.
Simulate gene expression data (UMI counts) under a toy expression model. Samples (expression profiles) are drawn from a multinomial topic model in which topics are "gene programs".
simulate_toy_gene_data(n, m, k, s)
simulate_toy_gene_data(n, m, k, s)
n |
The number of samples (gene expression profiles) to simulate. |
m |
The number of counts (genes) to simulate. |
k |
The number of topics ("gene programs") used to simulate the data. |
s |
A scalar specifying the total expression of each sample;
it specifies the "size" parameter in the calls to
|
The mixture proportions are generated as follows. With probability 0.9, one proportion is one, or close to one, and the remaining are zero, or close to zero; that is, the counts are primarily generated from a single gene program. Otherwise (wtth probability 0.1), the mixture proportions are roughly equal.
Gene frequencies are drawn uniformly at random from [0,1].
The return value is a list containing the counts matrix
X
, and the gene frequencies F
and mixture proportions
L
used to generate the counts.
Create a “Structure plot” from a multinomial topic model fit or other model with “loadings” or “weights”. The Structure plot represents the estimated topic proportions of each sample in a stacked bar chart, with bars of different colors representing different topics. Consequently, samples that have similar topic proportions have similar amounts of each color.
structure_plot( fit, topics, grouping, loadings_order = "embed", n = 2000, colors, gap = 1, embed_method = structure_plot_default_embed_method, ggplot_call = structure_plot_ggplot_call, ... ) structure_plot_default_embed_method(fit, ...) ## S3 method for class 'poisson_nmf_fit' plot(x, ...) ## S3 method for class 'multinom_topic_model_fit' plot(x, ...) structure_plot_ggplot_call(dat, colors, ticks = NULL, font.size = 9)
structure_plot( fit, topics, grouping, loadings_order = "embed", n = 2000, colors, gap = 1, embed_method = structure_plot_default_embed_method, ggplot_call = structure_plot_ggplot_call, ... ) structure_plot_default_embed_method(fit, ...) ## S3 method for class 'poisson_nmf_fit' plot(x, ...) ## S3 method for class 'multinom_topic_model_fit' plot(x, ...) structure_plot_ggplot_call(dat, colors, ticks = NULL, font.size = 9)
fit |
An object of class “poisson_nmf_fit” or
“multinom_topic_model_fit”, or an n x k matrix of topic
proportions, where k is the number of topics. (The elements in each
row of this matrix should sum to 1.) If a Poisson NMF fit is
provided as input, the corresponding multinomial topic model fit is
automatically recovered using |
topics |
Top-to-bottom ordering of the topics in the Structure
plot; |
grouping |
Optional categorical variable (a factor) with one
entry for each row of the loadings matrix |
loadings_order |
Ordering of the rows of the loadings matrix
|
n |
The maximum number of samples (rows of the loadings matrix
|
colors |
Colors used to draw topics in Structure plot. |
gap |
The horizontal spacing between groups. Ignored if
|
embed_method |
The function used to compute an 1-d embedding
from a loadings matrix |
ggplot_call |
The function used to create the plot. Replace
|
... |
Additional arguments passed to |
x |
An object of class “poisson_nmf_fit” or
“multinom_topic_model_fit”. If a Poisson NMF fit is provided
as input, the corresponding multinomial topic model fit is
automatically recovered using |
dat |
A data frame passed as input to
|
ticks |
The placement of the group labels along the horizontal
axis, and their names. For data that are not grouped, use
|
font.size |
Font size used in plot. |
The name “Structure plot” comes from its widespread use in population genetics to visualize the results of the Structure software (Rosenberg et al, 2002).
For most uses of the Structure plot in population genetics, there is usually some grouping of the samples (e.g., assignment to pre-defined populations) that guides arrangement of the samples along the horizontal axis in the bar chart. In other applications, such as analysis of gene expression data, a pre-defined grouping may not always be available. Therefore, a “smart” arrangement of the samples is, by default, generated automatically by performing a 1-d embedding of the samples.
A ggplot
object.
Dey, K. K., Hsiao, C. J. and Stephens, M. (2017). Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genetics 13, e1006599.
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A. and Feldman, M. W. (2002). Genetic structure of human populations. Science 298, 2381–2385.
set.seed(1) data(pbmc_facs) # Get the multinomial topic model fitted to the # PBMC data. fit <- pbmc_facs$fit # Create a Structure plot without labels. The samples (rows of L) are # automatically arranged along the x-axis using t-SNE to highlight the # structure in the data. p1a <- structure_plot(fit) # The first argument to structure_plot may also be an "L" matrix. # This call to structure_plot should produce the exact same plot as # the previous call. set.seed(1) p1b <- structure_plot(fit$L) # There is no requirement than the rows of L sum up to 1. To # illustrate, in this next example we have removed topic 5 from the a # structure plot. p2a <- structure_plot(fit$L[,-5]) # This is perhaps a more elegant way to remove topic 5 from the # structure plot: p2b <- structure_plot(fit,topics = c(1:4,6)) # Create a Structure plot with the FACS cell-type labels. Within each # group (cell-type), the cells (rows of L) are automatically arranged # using t-SNE. subpop <- pbmc_facs$samples$subpop p3 <- structure_plot(fit,grouping = subpop) # Next, we apply some customizations to improve the plot: (1) use the # "topics" argument to specify the order in which the topic # proportions are stacked on top of each other; (2) use the "gap" # argrument to increase the whitespace between the groups; (3) use "n" # to decrease the number of rows of L included in the Structure plot; # and (4) use "colors" to change the colors used to draw the topic # proportions. topic_colors <- c("skyblue","forestgreen","darkmagenta", "dodgerblue","gold","darkorange") p4 <- structure_plot(fit,grouping = pbmc_facs$samples$subpop,gap = 20, n = 1500,topics = c(5,6,1,4,2,3),colors = topic_colors) # In this example, we use UMAP instead of t-SNE to arrange the # cells in the Structure plot. Note that this can be accomplished in # a different way by overriding the default setting of # "embed_method". y <- drop(umap_from_topics(fit,dims = 1)) p5 <- structure_plot(fit,loadings_order = order(y),grouping = subpop, gap = 40,colors = topic_colors) # We can also use PCA to arrange the cells. y <- drop(pca_from_topics(fit,dims = 1)) p6 <- structure_plot(fit,loadings_order = order(y),grouping = subpop, gap = 40,colors = topic_colors) # In this final example, we plot a random subset of 400 cells, and # arrange the cells randomly along the horizontal axis of the # Structure plot. p7 <- structure_plot(fit,loadings_order = sample(3744,400),gap = 10, grouping = subpop,colors = topic_colors)
set.seed(1) data(pbmc_facs) # Get the multinomial topic model fitted to the # PBMC data. fit <- pbmc_facs$fit # Create a Structure plot without labels. The samples (rows of L) are # automatically arranged along the x-axis using t-SNE to highlight the # structure in the data. p1a <- structure_plot(fit) # The first argument to structure_plot may also be an "L" matrix. # This call to structure_plot should produce the exact same plot as # the previous call. set.seed(1) p1b <- structure_plot(fit$L) # There is no requirement than the rows of L sum up to 1. To # illustrate, in this next example we have removed topic 5 from the a # structure plot. p2a <- structure_plot(fit$L[,-5]) # This is perhaps a more elegant way to remove topic 5 from the # structure plot: p2b <- structure_plot(fit,topics = c(1:4,6)) # Create a Structure plot with the FACS cell-type labels. Within each # group (cell-type), the cells (rows of L) are automatically arranged # using t-SNE. subpop <- pbmc_facs$samples$subpop p3 <- structure_plot(fit,grouping = subpop) # Next, we apply some customizations to improve the plot: (1) use the # "topics" argument to specify the order in which the topic # proportions are stacked on top of each other; (2) use the "gap" # argrument to increase the whitespace between the groups; (3) use "n" # to decrease the number of rows of L included in the Structure plot; # and (4) use "colors" to change the colors used to draw the topic # proportions. topic_colors <- c("skyblue","forestgreen","darkmagenta", "dodgerblue","gold","darkorange") p4 <- structure_plot(fit,grouping = pbmc_facs$samples$subpop,gap = 20, n = 1500,topics = c(5,6,1,4,2,3),colors = topic_colors) # In this example, we use UMAP instead of t-SNE to arrange the # cells in the Structure plot. Note that this can be accomplished in # a different way by overriding the default setting of # "embed_method". y <- drop(umap_from_topics(fit,dims = 1)) p5 <- structure_plot(fit,loadings_order = order(y),grouping = subpop, gap = 40,colors = topic_colors) # We can also use PCA to arrange the cells. y <- drop(pca_from_topics(fit,dims = 1)) p6 <- structure_plot(fit,loadings_order = order(y),grouping = subpop, gap = 40,colors = topic_colors) # In this final example, we plot a random subset of 400 cells, and # arrange the cells randomly along the horizontal axis of the # Structure plot. p7 <- structure_plot(fit,loadings_order = sample(3744,400),gap = 10, grouping = subpop,colors = topic_colors)
summary
method for the “poisson_nmf_fit”
and “multinom_topic_model_fit” classes.
## S3 method for class 'poisson_nmf_fit' summary(object, ...) ## S3 method for class 'multinom_topic_model_fit' summary(object, ...) ## S3 method for class 'summary.poisson_nmf_fit' print(x, show.mixprops = FALSE, show.topic.reps = FALSE, ...) ## S3 method for class 'summary.multinom_topic_model_fit' print( x, show.size.factors = FALSE, show.mixprops = FALSE, show.topic.reps = FALSE, ... )
## S3 method for class 'poisson_nmf_fit' summary(object, ...) ## S3 method for class 'multinom_topic_model_fit' summary(object, ...) ## S3 method for class 'summary.poisson_nmf_fit' print(x, show.mixprops = FALSE, show.topic.reps = FALSE, ...) ## S3 method for class 'summary.multinom_topic_model_fit' print( x, show.size.factors = FALSE, show.mixprops = FALSE, show.topic.reps = FALSE, ... )
object |
An object of class “poisson_nmf_fit” or
“multinom_topic_model_fit”. The former is usually the result
of calling |
... |
Additional arguments passed to the generic |
x |
An object of class “summary.poisson_nmf_fit”,
usually a result of a call to |
show.mixprops |
If |
show.topic.reps |
If |
show.size.factors |
If |
The functions summary.poisson_nmf_fit
and
summary.multinom_topic_model_fit
compute and return a list
of statistics summarizing the model fit. The returned list
includes some or all of the following elements:
n |
The number of rows in the counts matrix, typically the number of samples. |
m |
The number of columns in the counts matrix, typically the number of observed counts per sample. |
k |
The rank of the Poisson NMF or the number of topics. |
s |
A vector of length n giving the "size factor" estimates; these estimates should be equal, or close to, the total counts in each row of the counts matrix. |
numiter |
The number of loadings and/or factor updates performed. |
loglik |
The Poisson NMF log-likelihood. |
loglik.multinom |
The multinomial topic model log-likelihood. |
dev |
The Poisson NMF deviance. |
res |
The maximum residual of the Karush-Kuhn-Tucker (KKT) first-order optimality conditions. This can be used to assess convergence of the updates to a (local) solution. |
mixprops |
Matrix giving a high-level summary of the mixture proportions, in which rows correspond to topics, and columns are ranges of mixture proportionss. |
topic.reps |
A matrix in which the ith row gives the mixture proportions for the sample "most representative" of topic i; by "most representative", we mean the row (or sample) with the highest proportion of counts drawn from the topic i. |
Create a “volcano” plot to visualize the
results of a differential count analysis using a topic model. Here,
the volcano plot is a scatterplot in which the posterior mean
log-fold change (LFC), estimated by running the methods implemented
in de_analysis
, is plotted against the estimated
z-score. Variations on this volcano plot may also be created, for
example by showing f0 (the null-model estimates) instead of the
z-scores. Use volcano_plotly
to create an interactive
volcano plot.
volcano_plot( de, k, labels, y = c("z", "f0"), do.label = volcano_plot_do_label_default, ymin = 1e-06, ymax = Inf, max.overlaps = Inf, plot.title = paste("topic", k), ggplot_call = volcano_plot_ggplot_call ) ## S3 method for class 'topic_model_de_analysis' plot(x, ...) volcano_plotly( de, k, file, labels, y = c("z", "f0"), ymin = 1e-06, ymax = Inf, width = 500, height = 500, plot.title = paste("topic", k), plot_ly_call = volcano_plot_ly_call ) volcano_plot_do_label_default(lfc, y) volcano_plot_ggplot_call(dat, y, plot.title, max.overlaps = Inf, font.size = 9) volcano_plot_ly_call(dat, y, plot.title, width, height)
volcano_plot( de, k, labels, y = c("z", "f0"), do.label = volcano_plot_do_label_default, ymin = 1e-06, ymax = Inf, max.overlaps = Inf, plot.title = paste("topic", k), ggplot_call = volcano_plot_ggplot_call ) ## S3 method for class 'topic_model_de_analysis' plot(x, ...) volcano_plotly( de, k, file, labels, y = c("z", "f0"), ymin = 1e-06, ymax = Inf, width = 500, height = 500, plot.title = paste("topic", k), plot_ly_call = volcano_plot_ly_call ) volcano_plot_do_label_default(lfc, y) volcano_plot_ggplot_call(dat, y, plot.title, max.overlaps = Inf, font.size = 9) volcano_plot_ly_call(dat, y, plot.title, width, height)
de |
An object of class “topic_model_de_analysis”,
usually an output from |
k |
The topic, selected by number or name. |
labels |
Character vector specifying how the points in the
volcano plot are labeled. This should be a character vector with
one entry per LFC estimate (row of |
y |
A vector of the same length as |
do.label |
The function used to deetermine which LFC estimates
to label. Replace |
ymin |
Y-axis values less than |
ymax |
Y-axis values greater than |
max.overlaps |
Argument passed to
|
plot.title |
The title of the plot. |
ggplot_call |
The function used to create the plot. Replace
|
x |
An object of class “topic_model_de_analysis”,
usually an output from |
... |
Additional arguments passed to |
file |
Save the interactive volcano plot to this HTML
file using |
width |
Width of the plot in pixels. Passed as argument
“width” to |
height |
Height of the plot in pixels. Passed as argument
“height” to |
plot_ly_call |
The function used to create the plot. Replace
|
lfc |
A vector of log-fold change estimates. |
dat |
A data frame passed as input to
|
font.size |
Font size used in plot. |
Interactive volcano plots can be created using the ‘plotly’ package. The “hover text” shows the label and detailed LFC statistics.
A ggplot
object or a plotly
object.
# See help(de_analysis) for examples.
# See help(de_analysis) for examples.