Introduction to the empirical Bayes normal means model via shrinkage estimation

The normal means model and empirical Bayes

Given n observations x_i with known standard deviations s_i > 0, i = 1, …, n, the normal means model (Robbins 1951; Efron and Morris 1972; Stephens 2017; Bhadra et al. 2019; Johnstone 2019; Sun 2020) has where the unknown (“true”) means θ_i are the quantities to be estimated. Here and throughout, we use 𝒩(μ, σ²) to denote the normal distribution with mean μ and variance σ².

The empirical Bayes (EB) approach to inferring θ_i attempts to improve upon the maximum-likelihood estimate θ̂_i = x_i by “borrowing information” across observations, exploiting the fact that each observation contains information not only about its respective mean, but also about how the means are collectively distributed (Robbins 1956; Morris 1983; Efron 2010; Stephens 2017). Specifically, the empirical Bayes normal means (EBNM) approach assumes that where 𝒢 is some family of distributions that is specified in advance and g ∈ 𝒢 is estimated using the data.

The EBNM model is fit by first using all of the observations to estimate the prior g ∈ 𝒢, and then using the estimated distribution ĝ to compute posteriors and/or posterior summaries for the “true” means θ_i. Commonly, g is estimated via maximum-likelihood and posterior means are used as point estimates for the unknown means. The ebnm package provides a unified interface for efficiently carrying out both steps, with a wide range of available options for the prior family 𝒢.

For a detailed introduction, see our ebnm paper. For further background, see for example John Storey’s book.

An illustration: shrinkage estimation

Our example data set consists of 400 data points simulated from a normal means model in which the true prior g is a mixture of (a) a normal distribution centered at 2 and (b) a point-mass also centered at 2:

θ_i ∼ 0.8δ₂ + 0.2N(2, 1)

First, we simulate the “true” means θ_i from this prior:

set.seed(1)
n <- 400
u <- 2 + (runif(n) < 0.2) * rnorm(n)

Next, we simulate the observed means x_i as “noisy” estimates of the true means (in this example, the noise is homoskedastic):

x_i ∼ N(θ_i, s_i), s_i = 1/3,

s <- rep(1/3, n)
x <- u + s * rnorm(n)

Although we know what the true means are in this example, we’ll treat them as quantities we cannot observe.

The maximum-likelihood estimates (MLEs) of the true means are simply û_i = x_i:

par(mar = c(4, 4, 2, 2))
lims <- c(-0.55, 5.05)
plot(u, x, pch = 4, cex = 0.75, xlim = lims, ylim = lims,
     xlab = "true value", ylab = "estimate", main = "MLE")
abline(a = 0, b = 1, col = "magenta", lty = "dotted")

We can do better than the MLE — and in fact some theory tells us we are guaranteed to do better — by learning a prior using all the observations, then “shrinking” the estimates toward this prior.

Let’s illustrate this idea with a simple normal prior in which the mean and variance of the normal prior are learned from the data. (Note that the normal prior is the wrong prior for this data set! Recall we that simulated data using a mixture of a normal and a point-mass.)

First, we fit the prior:

library("ebnm")
fit_normal <- ebnm(x, s, prior_family = "normal", mode = "estimate")

Next we estimate the true means using posterior means û_i = E[θ_i | x_i, ĝ]. We extract these posterior means using the coef() method:

y <- coef(fit_normal)
par(mar = c(4, 4, 2, 2))
plot(u, y, pch = 4, cex = 0.75, xlim = lims, ylim = lims,
     xlab = "true value", ylab = "estimate", main = "normal prior")
abline(a = 0, b = 1, col = "magenta", lty = "dotted")

These “shrunken” estimates are better when true means θ_i are near 2, but worse when they are far from 2. Still, they substantially improve the overall estimation error (the “root mean-squared error” or RMSE):

err_mle           <- (x - u)^2
err_shrink_normal <- (y - u)^2
print(round(digits = 4,
            x = c(mle           = sqrt(mean(err_mle)),
                  shrink_normal = sqrt(mean(err_shrink_normal)))))
#           mle shrink_normal 
#        0.3599        0.2868

Here’s a more detailed comparison of the estimation error:

par(mar = c(4, 4, 2, 2))
plot(err_mle, err_shrink_normal, pch = 4, cex = 0.75,
     xlim = c(0, 1.2), ylim = c(0, 1.2))
abline(a = 0, b = 1, col = "magenta", lty = "dotted")

Indeed, the error increases in a few of the estimates and decreases in many of the other estimates, resulting in a lower RMSE over the 400 data points.

Let’s now see what happens when we use a family of priors that is better suited to this data set — specifically, the “point-normal” family. Notice that the only change we make in our call to ebnm() is in the prior_family argument:

fit_pn <- ebnm(x, s, prior_family = "point_normal", mode = "estimate")

Now we extract the posterior mean estimates and compare to the true values:

par(mar = c(4, 4, 2, 2))
y <- coef(fit_pn)
plot(u, y, pch = 4, cex = 0.75, xlim = lims, ylim = lims,
     xlab = "true value", ylab = "estimate", main = "point-normal prior")
abline(a = 0, b = 1, col = "magenta", lty = "dotted")

The added flexibility of the point-normal prior improves the accuracy of estimates for means near 2, while estimates for means far from 2 are no worse than the MLEs. The result is that the overall RMSE again sees a substantial improvement:

err_shrink_pn <- (y - u)^2
print(round(digits = 4,
            x = c(mle = sqrt(mean(err_mle)),
                  normal = sqrt(mean(err_shrink_normal)),
                  point_normal = sqrt(mean(err_shrink_pn)))))
#          mle       normal point_normal 
#       0.3599       0.2868       0.2100

Session information

The following R version and packages were used to generate this vignette:

sessionInfo()
# R version 4.4.2 (2024-10-31)
# Platform: x86_64-pc-linux-gnu
# Running under: Ubuntu 24.04.2 LTS
# 
# Matrix products: default
# BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
# LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# time zone: Etc/UTC
# tzcode source: system (glibc)
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] mcmc_0.9-8     cowplot_1.1.3  ggplot2_3.5.1  ebnm_1.1-34    rmarkdown_2.29
# 
# loaded via a namespace (and not attached):
#  [1] sass_0.4.9         generics_0.1.3     ashr_2.2-66        lattice_0.22-6    
#  [5] digest_0.6.37      magrittr_2.0.3     evaluate_1.0.3     grid_4.4.2        
#  [9] RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_1.8.9     Matrix_1.7-2      
# [13] mixsqp_0.3-54      scales_1.3.0       truncnorm_1.0-9    invgamma_1.1      
# [17] jquerylib_0.1.4    cli_3.6.3          rlang_1.1.5        deconvolveR_1.2-1 
# [21] munsell_0.5.1      splines_4.4.2      withr_3.0.2        cachem_1.1.0      
# [25] yaml_2.3.10        tools_4.4.2        SQUAREM_2021.1     dplyr_1.1.4       
# [29] colorspace_2.1-1   buildtools_1.0.0   vctrs_0.6.5        R6_2.5.1          
# [33] lifecycle_1.0.4    trust_0.1-8        irlba_2.3.5.1      pkgconfig_2.0.3   
# [37] pillar_1.10.1      bslib_0.9.0        gtable_0.3.6       glue_1.8.0        
# [41] Rcpp_1.0.14        xfun_0.50          tibble_3.2.1       tidyselect_1.2.1  
# [45] sys_3.4.3          knitr_1.49         farver_2.1.2       htmltools_0.5.8.1 
# [49] maketools_1.3.2    labeling_0.4.3     compiler_4.4.2     horseshoe_0.2.0

References

Bhadra, Anindya, Jyotishka Datta, Nicholas G. Polson, and Brandon Willard. 2019. “Lasso Meets Horseshoe: A Survey.” Statistical Science 34 (3): 405–27.

Efron, Bradley. 2010. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Vol. 1. Institute of Mathematical Statistics Monographs. Cambridge, UK: Cambridge University Press.

Efron, Bradley, and Carl Morris. 1972. “Limiting the Risk of Bayes and Empirical Bayes Estimators—Part II: The Empirical Bayes Case.” Journal of the American Statistical Association 67 (337): 130–39.

Johnstone, Iain. 2019. “Gaussian Estimation: Sequence and Wavelet Models.” https://imjohnstone.su.domains/.

Morris, Carl N. 1983. “Parametric Empirical Bayes Inference: Theory and Applications.” Journal of the American Statistical Association 78 (381): 47–55.

Robbins, Herbert. 1951. “Asymptotically Subminimax Solutions of Compound Statistical Decision Problems.” In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1951, Vol. II, 131–49. University of California Press, Berkeley; Los Angeles, CA.

———. 1956. “An Empirical Bayes Approach to Statistics.” In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1956, Vol. I, 157–63. University of California Press, Berkeley; Los Angeles, CA.

Stephens, Matthew. 2017. “False Discovery Rates: A New Deal.” Biostatistics 18 (2): 275–94.

Sun, Lei. 2020. “Topics on Empirical Bayes Normal Means.” PhD thesis, Chicago, IL: University of Chicago.