Lecture 21
bench# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 74.5µs 89.5µs 11091. 236.52KB 36.0
2 d[which(d$x > 0.5), ] 76µs 90.8µs 10827. 271.91KB 65.4
3 subset(d, x > 0.5) 92.6µs 114.1µs 8756. 289.07KB 53.9
4 filter(d, x > 0.5) 249.5µs 292.2µs 3438. 1.48MB 21.3
# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 7.21ms 7.52ms 133. 13.3MB 79.8
2 d[which(d$x > 0.5), ] 8.64ms 8.97ms 111. 24.8MB 144.
3 subset(d, x > 0.5) 11.16ms 11.55ms 86.0 24.8MB 108.
4 filter(d, x > 0.5) 8.38ms 9.5ms 103. 24.8MB 70.2
bench - relative results# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 d[d$x > 0.5, ] 1 1 1.55 1 1.14
2 d[which(d$x > 0.5), ] 1.20 1.19 1.29 1.86 2.05
3 subset(d, x > 0.5) 1.55 1.54 1 1.86 1.53
4 filter(d, x > 0.5) 1.16 1.26 1.20 1.86 1
An awful lot of statistics is at its core linear algebra.
For example:
\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]
Principle component analysis
Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).
Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).
Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.
Numerical linear algebra \(\ne\) mathematical linear algebra
Efficiency and stability of numerical algorithms matter
Don’t reinvent the wheel - common core linear algebra tools (well defined API)
Low level algorithms for common linear algebra operations
Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran
Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)
Most default BLAS and LAPACK implementations (like R’s defaults) are somewhat dated
Written in Fortran and designed for a single cpu core
Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives:
ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools
cuBLAS / Magma - GPU libraries from Nvidia and UTK respectively
Accelerate / vecLib - Apple’s framework for GPU and multicore computing
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] C.UTF-8/en_US.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] parallel stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2 dplyr_1.1.4
[5] purrr_1.1.0 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
[9] ggplot2_4.0.0 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] Matrix_1.7-4 gtable_0.3.6 jsonlite_2.0.0
[4] compiler_4.5.2 Rcpp_1.1.0 tidyselect_1.2.1
[7] scales_1.4.0 yaml_2.3.10 fastmap_1.2.0
[10] lattice_0.22-7 R6_2.6.1 generics_0.1.4
[13] knitr_1.50 pillar_1.11.1 RColorBrewer_1.1-3
[16] tzdb_0.5.0 rlang_1.1.6 utf8_1.2.6
[19] stringi_1.8.7 xfun_0.54 S7_0.2.0
[22] timechange_0.3.0 cli_3.6.5 withr_3.0.2
[25] magrittr_2.0.4 digest_0.6.37 grid_4.5.2
[28] hms_1.1.4 lifecycle_1.0.4 vctrs_0.6.5
[31] bench_1.1.4 evaluate_1.0.5 glue_1.8.0
[34] farver_2.1.2 profmem_0.7.0 rmarkdown_2.30
[37] tools_4.5.2 pkgconfig_2.0.3 htmltools_0.5.8.1
| n | 1 core | 2 cores | 4 cores | 8 cores | 16 cores |
|---|---|---|---|---|---|
| 100 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 500 | 0.004 | 0.003 | 0.002 | 0.002 | 0.004 |
| 1000 | 0.028 | 0.016 | 0.010 | 0.007 | 0.009 |
| 2000 | 0.207 | 0.110 | 0.058 | 0.035 | 0.039 |
| 3000 | 0.679 | 0.352 | 0.183 | 0.103 | 0.081 |
| 4000 | 1.587 | 0.816 | 0.418 | 0.227 | 0.145 |
| 5000 | 3.104 | 1.583 | 0.807 | 0.453 | 0.266 |
parallelPart of the base packages in R
tools for the forking of R processes (some functions do not work on Windows)
Some core functions:
detectCores
pvec
mclapply
mcparallel & mccollect
There are many other functions for creating and managing “clusters” of R processes, but we won’t cover those here.
detectCoresSurprisingly, this detects the number of cores of the current system.
Take the reported number with a grain of salt / it is important to know somethimg about the hardware you are using. Modern CPUs are complicated and have many features (e.g. hyperthreading, P vs E cores, etc) that can make this number only part of the whole story.
This number also is not aware of other users / processes on the system / job scheduling constraints - so be considerate & conservative when choosing how many cores to use for your parallel computations.
Parallelize a vectorized function call, FUN must be a vectorized pure function (i.e. no side effects) that returns a vector the same length as x.
bench::system_timeIf you need more precise timing information, bench::system_time is a good alternative.
Implements a parallelized version of lapply. It relies on forking and hence is not available on Windows unless mc.cores = 1.
Asynchronously evaluation of an R expression in a separate process
Checks mcparallel objects for completion - if finished the values are returned, otherwise the calling R process is blocked until they are done.
If you don’t want to block the calling R process, you can use the wait = FALSE argument. This will return NULL if the process is not yet complete.
Random number generation in parallel contexts requires special consideration:
Multiple processes using the same RNG seed produce identical “random” sequences
Introducing correlation into our RNG invalidates statistical results (e.g., bootstrap, Monte Carlo)
For most parallel package functions, the default behavior ensures proper RNG handling, but be aware of this issue when writing custom parallel code.
“Reproducibility” can be a function of both your data and core / worker size.
snow (2003) - Simple Network of Workstations, cluster-based parallelizationmulticore (2009-2014) - Fork-based parallelization for Unix-like systemsforeach (2009) - Iterative parallel execution with pluggable backendsparallel - Merged snow and multicore into base R (R 2.14.0)foreach + backends (doParallel, doMC, doFuture) - Still widely used w/ flexible backendsfuture (2015) - Modern async framework, support for local/remote/cluster computingmirai (2022) - Lightweight async evaluation, powers modern tidyverse parallelizationis a modern framework for asynchronous parallel computing in R
Built on nanonext & NNG messaging (IPC, TCP, TLS)
Scales to millions of tasks with minimal overhead
Transparent evaluation with error handling
Works locally, remote (SSH), and on HPC clusters
Core functions:
mirai() - evaluate R expressions asynchronouslydaemons() - persistent background processesmirai_map() - parallel map operationsmirai() usageJust like mcparallel, we can eval R expressions without blocking the session,
A mirai is either unresolved if the result has yet to be received, or resolved if it has. unresolved() is a helper function to check the state of a mirai.
The result of a mirai m is available at
m$dataonce it has resolved. Normally this will be the return value of the evaluated expression. If the expression errored, crashed, or timed out then this will be an ‘errorValue’ instead. See the section Error Handling below.Rather than repeatedly checking unresolved(m), it is more efficient to wait for and collect its value by using m[].
…
If execution in a mirai fails, the error message is returned as a character string of class ‘miraiError’ and ‘errorValue’ to facilitate debugging.
is_mirai_error()may be used to test for mirai execution errors.
Failed evaluations return miraiError objects,
The expression passed to mirai() will be evaluated in a separate R process in a clean environment (not the global environment). Any variables, functions or other objects that are needed must be supplied via ... and/or .args.
Use ... for simple arguments that are substituted into the experssion before interpretation
Use .args for functions, large objects, etc.
Similarly, if you are using functions from non-base packages you must use namespaced calls (e.g. dplyr::select()), or else the package should be loaded explicitly as part of .expr.
Note that when using daemones the function everywhere() can be used to evaluate an expression on all connected daemons, this can be useful to load packages or set options.
daemons()We can create persistent background processes to handle our mirai() requests, this reduces startup overhead for each task.
call_mirai() vs race_mirai()Call waits for all tasks to complete
mirai_map()This function is similar to
purrr::map(), but instead of a list of values, returns a ‘mirai_map’ object, which is a list of mirai.A
mirai_map()call returns almost immediately. The results of a mirai_map x may be collected usingx[]orcollect_mirai(x), the same as for a mirai. This waits for all asynchronous operations to complete if still in progress.
Key advantages
Returns immediately with all evaluations taking place asynchronously. Printing a ‘mirai map’ object shows the current completion progress.
The ‘.promise’ argument allows a promise action to be registered against each iteration. This can be used to perform side-effects when each iteration completes (such as checkpointing or recording a progress update).
Returns evaluation errors as ‘miraiError’ or ‘errorValue’ as the case may be, rather than causing the entire map to fail. This allows more efficient recovery from partial failure.
Does not rely on a ‘chunking’ algorithm that attempts to split work into batches according to the number of available daemons, as implemented for instance in the parallel package. Chunking cannot take into account varying or unpredictable compute times over the indices, which mirai scheduling is designed to deal with optimally.
in_parallel() (experimental)Enables parallel computation in purrr map functions using mirai,
Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a samples of size n (with replacement) from the original data, and using that to repeat an analysis procedure of interest. Below is an example of fitting a local regression (loess) to some synthetic data, we will construct a bootstrap prediction interval for this model.
Bootstraping Demo
Optimal use of parallelization / multiple cores is hard, there isn’t one best solution
Don’t underestimate the overhead cost
Experimentation is key
Measure it or it didn’t happen
Be aware of the trade off between developer time and run time
Sta 523 - Fall 2025