Profiling & Parallelization

Lecture 21

Dr. Colin Rundel

Profiling & Benchmarking

profvis - lm() demo

n = 1e6
d = tibble(
  x1 = rt(n, df = 3),
  x2 = rt(n, df = 3),
  x3 = rt(n, df = 3),
  x4 = rt(n, df = 3),
  x5 = rt(n, df = 3),
) |>
  mutate(y = -2*x1 - 1*x2 + 0*x3 + 1*x4 + 2*x5 + rnorm(n))

profvis::profvis({
  lm(y~., data=d)
})

profvis - vector demo

profvis::profvis({
  data = data.frame(value = runif(5e4))

  data$sum[1] = data$value[1]
  for (i in seq(2, nrow(data))) {
    data$sum[i] = data$sum[i-1] + data$value[i]
  }
})

profvis::profvis({
  x = runif(5e4)
  sum = x[1]
  for (i in seq(2, length(x))) {
    sum[i] = sum[i-1] + x[i]
  }
})

Benchmarking - `bench`

d = tibble(
  x = runif(10000),
  y = runif(10000)
)

(b = bench::mark(
  d[d$x > 0.5, ],
  d[which(d$x > 0.5), ],
  subset(d, x > 0.5),
  filter(d, x > 0.5)
))

# A tibble: 4 × 6
  expression                 min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 d[d$x > 0.5, ]          74.5µs   89.5µs    11091.  236.52KB     36.0
2 d[which(d$x > 0.5), ]     76µs   90.8µs    10827.  271.91KB     65.4
3 subset(d, x > 0.5)      92.6µs  114.1µs     8756.  289.07KB     53.9
4 filter(d, x > 0.5)     249.5µs  292.2µs     3438.    1.48MB     21.3

Larger n

d = tibble(
  x = runif(1e6),
  y = runif(1e6)
)

(b = bench::mark(
  d[d$x > 0.5, ],
  d[which(d$x > 0.5), ],
  subset(d, x > 0.5),
  filter(d, x > 0.5)
))

# A tibble: 4 × 6
  expression                 min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 d[d$x > 0.5, ]          7.21ms   7.52ms     133.     13.3MB     79.8
2 d[which(d$x > 0.5), ]   8.64ms   8.97ms     111.     24.8MB    144. 
3 subset(d, x > 0.5)     11.16ms  11.55ms      86.0    24.8MB    108. 
4 filter(d, x > 0.5)      8.38ms    9.5ms     103.     24.8MB     70.2

`bench` - relative results

summary(b, relative=TRUE)

# A tibble: 4 × 6
  expression              min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>            <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 d[d$x > 0.5, ]         1      1         1.55      1        1.14
2 d[which(d$x > 0.5), ]  1.20   1.19      1.29      1.86     2.05
3 subset(d, x > 0.5)     1.55   1.54      1         1.86     1.53
4 filter(d, x > 0.5)     1.16   1.26      1.20      1.86     1

Example - BLAS and LAPACK

Statistics and Linear Algebra

An awful lot of statistics is at its core linear algebra.

For example:

Linear regession models, find

\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]

Principle component analysis
- Find $T = XW$ where $W$ is a matrix whose columns are the eigenvectors of $X^TX$.
- Often solved via SVD - Let $X = U\Sigma W^T$ then $T = U\Sigma$.

Numerical Linear Algebra

Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.

Numerical linear algebra $\ne$ mathematical linear algebra
Efficiency and stability of numerical algorithms matter
- Designing and implementing these algorithms is hard
Don’t reinvent the wheel - common core linear algebra tools (well defined API)

BLAS and LAPACK

Low level algorithms for common linear algebra operations

BLAS

Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran

LAPACK

Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)

Modern variants?

Most default BLAS and LAPACK implementations (like R’s defaults) are somewhat dated

Written in Fortran and designed for a single cpu core
Certain (potentially non-optimal) hard coded defaults (e.g. block size).

Multithreaded alternatives:

ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools
cuBLAS / Magma - GPU libraries from Nvidia and UTK respectively
Accelerate / vecLib - Apple’s framework for GPU and multicore computing

R & BLAS / LAPACK

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] C.UTF-8/en_US.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
 [1] lubridate_1.9.4 forcats_1.0.1   stringr_1.5.2   dplyr_1.1.4    
 [5] purrr_1.1.0     readr_2.1.5     tidyr_1.3.1     tibble_3.3.0   
 [9] ggplot2_4.0.0   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] Matrix_1.7-4       gtable_0.3.6       jsonlite_2.0.0    
 [4] compiler_4.5.2     Rcpp_1.1.0         tidyselect_1.2.1  
 [7] scales_1.4.0       yaml_2.3.10        fastmap_1.2.0     
[10] lattice_0.22-7     R6_2.6.1           generics_0.1.4    
[13] knitr_1.50         pillar_1.11.1      RColorBrewer_1.1-3
[16] tzdb_0.5.0         rlang_1.1.6        utf8_1.2.6        
[19] stringi_1.8.7      xfun_0.54          S7_0.2.0          
[22] timechange_0.3.0   cli_3.6.5          withr_3.0.2       
[25] magrittr_2.0.4     digest_0.6.37      grid_4.5.2        
[28] hms_1.1.4          lifecycle_1.0.4    vctrs_0.6.5       
[31] bench_1.1.4        evaluate_1.0.5     glue_1.8.0        
[34] farver_2.1.2       profmem_0.7.0      rmarkdown_2.30    
[37] tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.8.1

OpenBLAS Matrix Multiply Performance

x = matrix(runif(5000^2),ncol=5000)

sizes = c(100,500,1000,2000,3000,4000,5000)
cores = c(1,2,4,8,16)

sapply(
  cores, 
  function(n_cores) {
    flexiblas::flexiblas_set_num_threads(n_cores)
    sapply(
      sizes, 
      function(s) {
        y = x[1:s,1:s]
        system.time(y %*% y)[3]
      }
    )
  }
)

n	1 core	2 cores	4 cores	8 cores	16 cores
100	0.000	0.000	0.000	0.000	0.000
500	0.004	0.003	0.002	0.002	0.004
1000	0.028	0.016	0.010	0.007	0.009
2000	0.207	0.110	0.058	0.035	0.039
3000	0.679	0.352	0.183	0.103	0.081
4000	1.587	0.816	0.418	0.227	0.145
5000	3.104	1.583	0.807	0.453	0.266

Parallelization in R

`parallel`

Part of the base packages in R

tools for the forking of R processes (some functions do not work on Windows)
Some core functions:
- detectCores
- pvec
- mclapply
- mcparallel & mccollect
There are many other functions for creating and managing “clusters” of R processes, but we won’t cover those here.

`detectCores`

Surprisingly, this detects the number of cores of the current system.

detectCores()

[1] 14

Take the reported number with a grain of salt / it is important to know somethimg about the hardware you are using. Modern CPUs are complicated and have many features (e.g. hyperthreading, P vs E cores, etc) that can make this number only part of the whole story.

This number also is not aware of other users / processes on the system / job scheduling constraints - so be considerate & conservative when choosing how many cores to use for your parallel computations.

pvec

Parallelize a vectorized function call, FUN must be a vectorized pure function (i.e. no side effects) that returns a vector the same length as x.

system.time(pvec(1:1e7, sqrt, mc.cores = 1))

   user  system elapsed 
  0.012   0.010   0.022

system.time(pvec(1:1e7, sqrt, mc.cores = 4))

   user  system elapsed 
  0.228   0.748   0.899

system.time(pvec(1:1e7, sqrt, mc.cores = 8))

   user  system elapsed 
  0.085   0.716   0.690

system.time(sqrt(1:1e7))

   user  system elapsed 
  0.012   0.007   0.019

pvec - `bench::system_time`

If you need more precise timing information, bench::system_time is a good alternative.

bench::system_time(Sys.sleep(.5))

process    real 
   36µs   497ms

system.time(Sys.sleep(.5))

   user  system elapsed 
  0.000   0.001   0.504

bench::system_time(pvec(1:1e7, sqrt, mc.cores = 1))

process    real 
 25.9ms  25.6ms

bench::system_time(pvec(1:1e7, sqrt, mc.cores = 4))

process    real 
  425ms   701ms

bench::system_time(pvec(1:1e7, sqrt, mc.cores = 8))

process    real 
  623ms   899ms

bench::system_time(sqrt(1:1e7))

process    real 
 23.8ms  23.4ms

Cores by size

cores = c(1,4,6,8,10)
order = 6:8
f = function(x,y) {
  system.time(
    pvec(1:(10^y), sqrt, mc.cores = x)
  )[3]
}

res = map(
  cores, 
  function(x) {
     map_dbl(order, f, x = x) |>
      setNames(paste0("10^",order))
  }
) |> 
  bind_rows() |>
  mutate(cores = cores) |>
  relocate(cores)

res

# A tibble: 5 × 4
  cores  `10^6` `10^7` `10^8`
  <dbl>   <dbl>  <dbl>  <dbl>
1     1 0.00100 0.0170  0.361
2     4 0.0870  0.708   7.81 
3     6 0.093   0.699   6.90 
4     8 0.0960  0.714   7.14 
5    10 0.110   0.764   7.34

mclapply

Implements a parallelized version of lapply. It relies on forking and hence is not available on Windows unless mc.cores = 1.

system.time(rnorm(1e7))

   user  system elapsed 
  0.145   0.003   0.149

system.time(unlist(mclapply(1:10, function(x) rnorm(1e6), mc.cores = 2)))

   user  system elapsed 
  0.190   0.693   0.770

system.time(unlist(mclapply(1:10, function(x) rnorm(1e6), mc.cores = 4)))

   user  system elapsed 
  0.198   0.693   0.711

system.time(unlist(mclapply(1:10, function(x) rnorm(1e6), mc.cores = 8)))

   user  system elapsed 
  0.207   0.754   0.727

system.time(unlist(mclapply(1:10, function(x) rnorm(1e6), mc.cores = 10)))

   user  system elapsed 
  0.212   0.763   0.732

mcparallel

Asynchronously evaluation of an R expression in a separate process

m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))

str(m)

List of 2
 $ pid: int 90613
 $ fd : int [1:2] 5 8
 - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

str(n)

List of 2
 $ pid: int 90614
 $ fd : int [1:2] 6 10
 - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

mccollect

Checks mcparallel objects for completion - if finished the values are returned, otherwise the calling R process is blocked until they are done.

str(mccollect(list(m,n,o)))

List of 3
 $ 90613: num [1:1000000] -0.4154 0.0169 2.0027 -0.8437 -0.308 ...
 $ 90614: num [1:1000000] 0.346 0.692 0.128 0.36 0.674 ...
 $ 90615: num [1:1000000] 1.715 0.153 0.444 0.778 0.746 ...

mccollect - waiting

If you don’t want to block the calling R process, you can use the wait = FALSE argument. This will return NULL if the process is not yet complete.

f = function(n) {
  Sys.sleep(5)
  mean(rnorm(n))
}
p = mcparallel(f(1e5))

mccollect(p, wait = FALSE)

NULL

mccollect(p, wait = FALSE, timeout = 5)

$`90616`
[1] 0.0006274726

mccollect(p, wait = FALSE)

Warning in selectChildren(jobs, timeout): cannot wait for child 90616
as it does not exist

NULL

Aside - RNG and Parallelization

Random number generation in parallel contexts requires special consideration:

Multiple processes using the same RNG seed produce identical “random” sequences
Introducing correlation into our RNG invalidates statistical results (e.g., bootstrap, Monte Carlo)
For most parallel package functions, the default behavior ensures proper RNG handling, but be aware of this issue when writing custom parallel code.
“Reproducibility” can be a function of both your data and core / worker size.

Example

set.seed(1234)
mclapply(
  1:4, function(x) sd(rnorm(100)), mc.cores = 3
) |>
  unlist()

[1] 0.9617990 0.9984449 1.0023891 0.9206979

set.seed(1234)
mclapply(
  1:4, function(x) sd(rnorm(100)), mc.cores = 3
) |>
  unlist()

[1] 1.0630479 1.0276156 1.0110946 0.9407919

set.seed(1234)
mclapply(
  1:4, function(x) sd(rnorm(100)), 
  mc.cores = 3, mc.set.seed = FALSE
) |>
  unlist()

[1] 1.004405 1.004405 1.004405 1.032187

set.seed(1234)
mclapply(
  1:4, function(x) sd(rnorm(100)), 
  mc.cores = 3, mc.set.seed = FALSE
) |>
  unlist()

[1] 1.004405 1.004405 1.004405 1.032187

RNGkind("L'Ecuyer-CMRG")
set.seed(1234)
mclapply(
  1:4, function(x) sd(rnorm(100)), mc.cores = 3
) |>
  unlist()

[1] 0.9350555 1.1025750 1.0046105 0.9288966

RNGkind("L'Ecuyer-CMRG")
set.seed(1234)
mclapply(
  1:4, function(x) sd(rnorm(100)), mc.cores = 3
) |>
  unlist()

[1] 0.9350555 1.1025750 1.0046105 0.9288966

parallel alternatives & predecessors

Historical landscape

Early packages

snow (2003) - Simple Network of Workstations, cluster-based parallelization
multicore (2009-2014) - Fork-based parallelization for Unix-like systems
foreach (2009) - Iterative parallel execution with pluggable backends

Consolidation

parallel - Merged snow and multicore into base R (R 2.14.0)

Modern alternatives

foreach + backends (doParallel, doMC, doFuture) - Still widely used w/ flexible backends
future (2015) - Modern async framework, support for local/remote/cluster computing
mirai (2022) - Lightweight async evaluation, powers modern tidyverse parallelization

mirai

is a modern framework for asynchronous parallel computing in R

Built on nanonext & NNG messaging (IPC, TCP, TLS)
Scales to millions of tasks with minimal overhead
Transparent evaluation with error handling
Works locally, remote (SSH), and on HPC clusters

Core functions:

mirai() - evaluate R expressions asynchronously
daemons() - persistent background processes
mirai_map() - parallel map operations

Basic `mirai()` usage

Just like mcparallel, we can eval R expressions without blocking the session,

library(mirai)
m = mirai(
  {
    Sys.sleep(time)
    rnorm(5L, mean)
  },
  time = 2L,
  mean = 4.5
)

< mirai [] >

m$data

'unresolved' logi NA

unresolved(m)

[1] TRUE

m[]

[1] 6.121372 4.118744 5.614244 3.363124 3.560779

m$data

[1] 6.121372 4.118744 5.614244 3.363124 3.560779

< mirai [$data] >

mirai objects

A mirai is either unresolved if the result has yet to be received, or resolved if it has. unresolved() is a helper function to check the state of a mirai.

The result of a mirai m is available at m$data once it has resolved. Normally this will be the return value of the evaluated expression. If the expression errored, crashed, or timed out then this will be an ‘errorValue’ instead. See the section Error Handling below.

Rather than repeatedly checking unresolved(m), it is more efficient to wait for and collect its value by using m[].

…

If execution in a mirai fails, the error message is returned as a character string of class ‘miraiError’ and ‘errorValue’ to facilitate debugging.

is_mirai_error() may be used to test for mirai execution errors.

Error handling

Failed evaluations return miraiError objects,

m = mirai({
  stop("Something went wrong!")
})

m[]

'miraiError' chr Error: Something went wrong!

is_mirai_error(m$data)

[1] TRUE

m$data$message

[1] "Something went wrong!"

m$data$stack.trace

[[1]]
stop("Something went wrong!")

m$data$condition.class

[1] "simpleError" "error"       "condition"

Notes on evaluation

The expression passed to mirai() will be evaluated in a separate R process in a clean environment (not the global environment). Any variables, functions or other objects that are needed must be supplied via ... and/or .args.

Use ... for simple arguments that are substituted into the experssion before interpretation
Use .args for functions, large objects, etc.

Similarly, if you are using functions from non-base packages you must use namespaced calls (e.g. dplyr::select()), or else the package should be loaded explicitly as part of .expr.

Note that when using daemones the function everywhere() can be used to evaluate an expression on all connected daemons, this can be useful to load packages or set options.

`daemons()`

We can create persistent background processes to handle our mirai() requests, this reduces startup overhead for each task.

daemons(0)
a = mirai({
  Sys.sleep(3)
})
b = mirai({
  Sys.sleep(2)
})
c = mirai({
  Sys.sleep(1)
})

system.time(
  call_mirai(list(a, b, c)) 
)

   user  system elapsed 
  0.001   0.001   3.016

daemons(1)
a = mirai({
  Sys.sleep(3)
})
b = mirai({
  Sys.sleep(2)
})
c = mirai({
  Sys.sleep(1)
})

system.time(
  call_mirai(list(a, b, c)) 
)

   user  system elapsed 
  0.000   0.000   5.939

daemons(3)
a = mirai({
  Sys.sleep(3)
})
b = mirai({
  Sys.sleep(2)
})
c = mirai({
  Sys.sleep(1)
})

system.time(
  call_mirai(list(a, b, c)) 
)

   user  system elapsed 
   0.00    0.00    2.93

`call_mirai()` vs `race_mirai()`

daemons(4)

Call waits for all tasks to complete

a = mirai({
  Sys.sleep(3)
  "A"
})
b = mirai({
  Sys.sleep(2)
  "B"
})
c = mirai({
  Sys.sleep(1)
  "C"
})

call_mirai(list(a, b, c))
c(a$data, b$data, c$data)

[1] "A" "B" "C"

Race waits for the first task to complete

a = mirai({
  Sys.sleep(3)
  "A"
})
b = mirai({
  Sys.sleep(2)
  "B"
})
c = mirai({
  Sys.sleep(1)
  "C"
})

race_mirai(list(a, b, c))
c(a$data, b$data, c$data)

[1] NA  NA  "C"

`mirai_map()`

This function is similar to purrr::map(), but instead of a list of values, returns a ‘mirai_map’ object, which is a list of mirai.

A mirai_map() call returns almost immediately. The results of a mirai_map x may be collected using x[] or collect_mirai(x), the same as for a mirai. This waits for all asynchronous operations to complete if still in progress.

Key advantages

Returns immediately with all evaluations taking place asynchronously. Printing a ‘mirai map’ object shows the current completion progress.

The ‘.promise’ argument allows a promise action to be registered against each iteration. This can be used to perform side-effects when each iteration completes (such as checkpointing or recording a progress update).

Returns evaluation errors as ‘miraiError’ or ‘errorValue’ as the case may be, rather than causing the entire map to fail. This allows more efficient recovery from partial failure.

Does not rely on a ‘chunking’ algorithm that attempts to split work into batches according to the number of available daemons, as implemented for instance in the parallel package. Chunking cannot take into account varying or unpredictable compute times over the indices, which mirai scheduling is designed to deal with optimally.

Scheduling Examples

daemons(4)
waits = c(1, 1, 4, 4, 1, 1, 1, 4)

system.time(
  mirai_map(waits, Sys.sleep)[]
)

   user  system elapsed 
  0.001   0.001   6.017

system.time(
  mclapply(waits, Sys.sleep, mc.cores = 4)
)

   user  system elapsed 
  0.002   0.033   8.029

daemons(4)
n = sample(1:8)
f = function(n) mean(rnorm(10^n))

system.time(
  mirai_map(n, f)[]
)

   user  system elapsed 
  0.000   0.001   2.562

system.time(
  mclapply(n, f, mc.cores = 2)
)

   user  system elapsed 
  0.028   0.028   3.019

purrr’s `in_parallel()` (experimental)

Enables parallel computation in purrr map functions using mirai,

system.time({
  x = mclapply(
    1:10,
    \(x) rnorm(1e6),
    mc.cores = 5
  ) |>
    unlist()
})

   user  system elapsed 
  0.455   0.775   0.937

mean(x)

[1] 7.349909e-05

sd(x)

[1] 1.000287

system.time({
  daemons(5)
  x = map(
    1:10,
    in_parallel(\(x) rnorm(1e6))
  ) |>
    unlist()
})

   user  system elapsed 
  0.118   0.151   0.576

mean(x)

[1] 0.0003887193

sd(x)

[1] 1.000248

Example - Bootstraping

Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a samples of size n (with replacement) from the original data, and using that to repeat an analysis procedure of interest. Below is an example of fitting a local regression (loess) to some synthetic data, we will construct a bootstrap prediction interval for this model.

set.seed(3212016)
d = data.frame(x = runif(250)) |>
  mutate(y = sin(2*pi*x) + rnorm(length(x)))

l = loess(y ~ x, data=d)
p = predict(l, se=TRUE)

d = d |> mutate(
  pred_y = p$fit,
  pred_y_se = p$se.fit
)

ggplot(d, aes(x,y)) +
  geom_point(color="gray50") +
  geom_ribbon(
    aes(ymin = pred_y - 1.96 * pred_y_se, 
        ymax = pred_y + 1.96 * pred_y_se), 
    fill="red", alpha=0.25
  ) +
  geom_line(aes(y=pred_y)) +
  theme_bw()

Bootstraping Demo

What to use when?

Optimal use of parallelization / multiple cores is hard, there isn’t one best solution

Don’t underestimate the overhead cost
Experimentation is key
Measure it or it didn’t happen
Be aware of the trade off between developer time and run time