Parallel computing with C++ from R

A. Stamm

Department of Mathematics Jean Leray, UMR CNRS 6629, Nantes University, Ecole Centrale de Nantes, France

2024-09-30

Information about the session

Code copy-pasting

All pieces of code can be copy-pasted into an R session: to do that, just hover over the code and click on the “Copy” button that appears on the top right corner of the code block.

R code. Copy-paste into an R script and run.

C++ code. Copy-paste:

either into a XXX.cpp file and compiled with Rcpp::sourceCpp(file = 'XXX.cpp');
or into the code = '' argument of Rcpp::sourceCpp() function.

Benchmarkings

Benchmarking

All time benchmarks are done using the {tictoc} package and on a MacBook Pro 2021 with an Apple M1 Pro chip including 10 cores and 32 GB of RAM under Sonoma 14.5 macOS.

Running multi-threaded code in C++

OpenMP (Dagum and Menon 1998)
Intel Thread Building Blocks (TBB) (Reinders 2007)
Boost.Thread (Schäling 2014)
TinyThread++
std::thread (C++11)
std::jthread (C++20)

Calling multi-threaded C++ code from R(cpp)

OpenMP (Dagum and Menon 1998)
RcppParallel
RcppThread (Nagler 2021) and the API documentation at https://tnagler.github.io/RcppThread/namespaceRcppThread.html.

OpenMP

Overview

Preliminary remarks

This section is built from Matteo Fasiolo’s online course.
We assume basic familiarity with OpenMP (“Beginning OpenMP”; “OpenMP API Reference Guide”; “OpenMP Reference Sheet for C/C++,” n.d. for a refresher).

The magic of OpenMP

OpenMP is a set of compiler directives, library routines, and environment variables that influence the behavior of parallelized code.
It is supported by most compilers, including g++, clang, and icc.
It is a simple and effective way to parallelize code: you start by adding a few directives to your existing sequential code, and the compiler does the rest.

Enabling OpenMP

OpenMP support on Windows and Linux

On Windows and Linux, OpenMP is supported by default in g++ and clang.
To enable OpenMP in g++, use the -fopenmp flag.
To enable OpenMP in clang, use the -Xclang -fopenmp flags.
Use Rcpp::plugins(openmp) in your C++ code to take care of the flags automagically.
Summary: you should have nothing to do on Windows and Linux.

Enabling OpenMP

OpenMP support on macOS

Apple has explicitly disabled OpenMP support in compilers that they ship in Xcode:

  $ clang -c omp.c -fopenmp
  clang: error: unsupported option '-fopenmp'

even though clang had OpenMP support for quite a long time now. In fact, the clang compiler in Xcode can generate all the necessary code for OpenMP. It can be tricked into performing its designed function by using -Xclang -fopenmp flags.

The unfortunate part about this is that Apple is not shipping the necessary libomp.dylib run-time library needed for OpenMP support. Fortunately, some clever folks made them available for us: see <https://mac.r-project.org/openmp/.

A toy function to play with

//---------------------------------
#include <unistd.h>
#include <Rcpp.h>

// [[Rcpp::export]]
bool wait_k_seconds(unsigned int sec)
{
    for (unsigned int i = 0;i < sec;++i)
        sleep(1);
    
    return EXIT_SUCCESS;
}

1: The unistd.h header file is needed for the sleep() function.
2: The [[Rcpp::export]] attribute tells Rcpp that it should generate the necessary R bindings for the function so that it can be called from R.

The previous code defines a simple function that waits for sec seconds and makes it available to R when compiled using Rcpp::sourceCpp(). We can test that it does what it is supposed to do by calling it from R:

system.time(wait_k_seconds(2))[3]

elapsed 
  2.119

A parallelized version via OpenMP

//---------------------------------
#include <unistd.h>
#include <Rcpp.h>

// // [[Rcpp::plugins(openmp)]]

// [[Rcpp::export]]
bool wait_k_seconds_omp(unsigned int sec, unsigned int ncores)
{
#if defined(_OPENMP)
    #pragma omp parallel num_threads(ncores)
    #pragma omp for
#endif
    for (unsigned int i = 0;i < sec;++i)
        sleep(1);
    
    return EXIT_SUCCESS;
}

1: Includes the correct OpenMP flags during compilation. Must not be included on macOS as OpenMP flags are handled in ~/.R/Makevars.
2: Checks if OpenMP is available and inserts the following code only if it is.
3: Indicates the beginning of a parallel section, to be executed on ncores parallel threads.
4: Tells the compiler that the for loop should be run in parallel.

Benchmarking

Run a 3 sec task on 1 thread:

system.time(wait_k_seconds_omp(3, 1))[3]

elapsed 
  3.293

Run a 3 sec task on 3 threads:

system.time(wait_k_seconds_omp(3, 3))[3]

elapsed 
  1.092

A more realistic example (R)

Say we want to check if all elements of a numeric vector are finite. We can write a first naive function in R:

all_finite_r_v1 <- function(x) {
  all(is.finite(x))
}

We can improve upon this version by summing all elements of the vector and checking if the result is finite:

all_finite_r_v2 <- function(x) {
  is.finite(sum(x))
}

A C++ version

//---------------------------------
#include <Rcpp.h>

// // [[Rcpp::plugins(openmp)]] // Uncomment on Windows and Linux

// [[Rcpp::export]]
bool all_finite_cpp(Rcpp::NumericVector x)
{
  unsigned int nbInputs = x.size();

  double out = 0;
  for (unsigned int i = 0;i < nbInputs;++i)
    out += x[i];

  return R_FINITE(out);
}

1: The function takes a Rcpp::NumericVector as input. This is used because Rcpp automatically converts between R vectors and C++ vectors which is why we can pass an R vector directly to the function that sourceCpp generates.
2: The function returns R_FINITE(out) which is a macro from the C API of R that checks if out is finite.

Parallelizing via OpenMP

//---------------------------------
#include <Rcpp.h>

// // [[Rcpp::plugins(openmp)]] // Uncomment on Windows and Linux

// [[Rcpp::export]]
bool all_finite_omp(Rcpp::NumericVector x, unsigned int ncores)
{
  unsigned int nbInputs = x.size();
  double out = 0;

#ifdef _OPENMP
#pragma omp parallel for reduction(+:out) num_threads(ncores)
#endif
  for (unsigned int i = 0;i < nbInputs;++i)
    out += x[i];

  return R_FINITE(out);
}

1: The function takes a Rcpp::NumericVector as input and an additional argument ncores which specifies the number of threads to use.
2: The reduction(+:out) clause tells the compiler that the out variable should be private to each thread and then combined at the end of the loop. This is necessary because out is shared between threads and would otherwise be overwritten by each thread.

Benchmarking

Code

x <- rnorm(1e8)
bm <- bench::mark(
  all(is.finite(x)),
  is.finite(sum(x)),
  all_finite_cpp(x),
  all_finite_omp(x,  1L), 
  all_finite_omp(x,  2L),
  all_finite_omp(x,  4L), 
  all_finite_omp(x,  8L),
  iterations = bch_runs, 
  time_unit = "ms"
)

Programming language	Expression	Median computation time	Memory allocation
R	all(is.finite(x))	209.79 ms	400 MB
R	is.finite(sum(x))	154.48 ms	0 B
C++	all_finite_cpp(x)	93.09 ms	0 B
C++ (OpenMP)	all_finite_omp(x, 1L)	93.72 ms	0 B
C++ (OpenMP)	all_finite_omp(x, 2L)	47.54 ms	4.7 kB
C++ (OpenMP)	all_finite_omp(x, 4L)	24.44 ms	0 B
C++ (OpenMP)	all_finite_omp(x, 8L)	14.20 ms	0 B

Cheatsheet for OpenMP

OpenMP in an R package

Setup your package to use {Rcpp} (usethis::use_rcpp()).
Edit src/Makevars file and add the following lines:

PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)

Create a C++ file src/omp_get_max_threads.cpp and start coding in it with OpenMP as we have seen (omitting the Rcpp::plugins() attributes).

Thread safety

Definition

A piece of code is thread-safe if it functions correctly during simultaneous execution by multiple threads. This is typically achieved by ensuring that shared data is accessed in a manner that avoids conflicts.

R & Rcpp API’s are not thread-safe

The code that you write within parallel workers should not call the R or Rcpp API in any fashion. This is because R is single-threaded and concurrent interaction with its data structures can cause crashes and other undefined behavior.

Calling any of the R API from threaded code is ‘for experts only’: they will need to read the source code to determine if it is thread-safe. In particular, code which makes use of the stack-checking mechanism must not be called from threaded code. Writing R Extensions (R Core Team, 2021).

Two consequences for R users

Problem 1: Random number generation

R’s C API provides access to the r*() functions for random number generation:

#include <Rcpp.h>

// [[Rcpp::export]]
double rnorm_cpp()
{
    return R::rnorm(0, 1);
}

They are not thread-safe and should not be called from within parallel workers.

Solution to Problem 1

Use a thread-safe generator such as the one provided by the {sitmo} package.

Two big consequences for R users

Problem 2: Reading from and writing to R vectors and matrices

Not being able to call the R or Rcpp API creates an obvious challenge: how to read and write to R vectors and matrices.

Solution to Problem 2

R vectors and matrices are just contiguous arrays of int, double, etc. Hence, they can be accessed using traditional array and pointer offsets. The {RcppParallel} package provides a convenient way to do this.

Threadsafe random number generation via {sitmo}

Sum of uniform samples (R)

sumunif <- function(n, nstep, seed) {
  withr::with_seed(seed, {
    rowSums(matrix(runif(n*nstep), n, nstep))   
  })
}

1: Set the seed for reproducibility. Using the {withr} package ensures that the seed is reset to its original value after the function call.

Sum of uniform samples (C++)

//---------------------------------
#include <Rcpp.h>
#include <sitmo.h>

// [[Rcpp::depends(sitmo)]]

// [[Rcpp::export]]
Rcpp::NumericVector sumunif_sitmo(unsigned int n,
                                  unsigned int nstep,
                                  unsigned int seed)
{
  Rcpp::NumericVector out(n);
  sitmo::prng eng(seed);
  double mx = sitmo::prng::max();
  double tmp = 0;

  for (unsigned int i = 0;i < n;++i)
  {
    tmp = 0.0;
    for (unsigned int k = 0;k < nstep;++k)
      tmp += eng() / mx;

    out[i] = tmp;
  }

  return out;
}

1: Include the {sitmo} package header to access the prng class.
2: Declare the dependency on the {sitmo} package so that the proper include and link flags are set at compile time. In a package, this would be done in the DESCRIPTION file by adding LinkingTo: sitmo.
3: Create a prng object with the specified seed.
4: Get the maximum value that the generator can produce.
5: Generate a uniform random number between 0 and 1.

Sum of uniform samples (OpenMP)

//---------------------------------
#include <Rcpp.h>
#include <sitmo.h>

// // [[Rcpp::plugins(openmp)]] // Uncomment on Windows and Linux

#ifdef _OPENMP
#include <omp.h>
#endif

// [[Rcpp::depends(sitmo)]]

// [[Rcpp::export]]
Rcpp::NumericVector sumunif_sitmo_omp(unsigned int n,
                                      unsigned int nstep,
                                      Rcpp::IntegerVector seeds)
{
  Rcpp::NumericVector out(n);

  unsigned int ncores = seeds.size();

#ifdef _OPENMP
#pragma omp parallel num_threads(ncores)
{
#endif
  unsigned int seed = seeds[0];

#ifdef _OPENMP
  seed = seeds[omp_get_thread_num()];
#endif

  sitmo::prng eng(seed);
  double mx = sitmo::prng::max();
  double tmp = 0;

#ifdef _OPENMP
#pragma omp for
#endif
  for (unsigned int i = 0;i < n;++i)
  {
    tmp = 0.0;
    for (unsigned int k = 0;k < nstep;++k)
      tmp += eng() / mx;

    out[i] = tmp;
  }

#ifdef _OPENMP
}
#endif

return out;
}

1: Include the OpenMP header file to access the omp_get_thread_num() function.
2: Pass a vector of seeds to the function: this allows each thread to have its own seed.
3: Get the number of cores from the length of the seeds vector.
4: Define the code section that will be parallelized and specify the number of threads.
5: Set the seed for the first thread to handle the case where OpenMP is not enabled.
6: Set the seed for each thread using each thead’s ID obtained from omp_get_thread_num().
7: Parallelize the outer loop using the #pragma omp for directive.
8: Close the parallel region.

Benchmarking

Code

n <- 1e6
nstep <- 1e3
seeds <- sample.int(1e6, 8)

bm <- bench::mark(
  rowSums(matrix(runif(n*nstep), n, nstep)),
  sumunif_sitmo(n, nstep, seeds[1]),
  sumunif_sitmo_omp(n, nstep, seeds[1:1]),
  sumunif_sitmo_omp(n, nstep, seeds[1:2]),
  sumunif_sitmo_omp(n, nstep, seeds[1:4]),
  sumunif_sitmo_omp(n, nstep, seeds[1:8]),
  iterations = bch_runs, 
  time_unit = "s", 
  check = FALSE
)

Programming language	Expression	Median computation time
R	rowSums(matrix(runif(n * nstep), n, nstep))	7.22 s
C++	sumunif_sitmo(n, nstep, seeds[1])	2.36 s
C++ (OpenMP)	sumunif_sitmo_omp(n, nstep, seeds[1:1])	2.43 s
C++ (OpenMP)	sumunif_sitmo_omp(n, nstep, seeds[1:2])	1.22 s
C++ (OpenMP)	sumunif_sitmo_omp(n, nstep, seeds[1:4])	0.62 s
C++ (OpenMP)	sumunif_sitmo_omp(n, nstep, seeds[1:8])	0.33 s

{RcppParallel}

Overview

{RcppParallel} provides a complete toolkit for creating portable, high-performance parallel algorithms without requiring direct manipulation of operating system threads.

Features

Intel TBB, a C++ library for task parallelism with a wide variety of parallel algorithms and data structures (Windows, OS X, Linux, and Solaris x86 only).
TinyThread, a C++ library for portable use of operating system threads.
RVector and RMatrix wrapper classes for safe and convenient access to R data structures in a multi-threaded environment.
High level parallel functions (parallelFor() and parallelReduce()) that use Intel TBB as a back-end if supported and TinyThread otherwise.

Vector and matrix accessor classes

//---------------------------------
#include <Rcpp.h>

// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>

// [[Rcpp::export]]
Rcpp::IntegerVector transformVector(Rcpp::IntegerVector x) {
  RcppParallel::RVector<int> input(x);
  Rcpp::IntegerVector y(x.size());
  RcppParallel::RVector<int> output(y);



  return y;
}

1: The Rcpp attribute [[Rcpp::depends(RcppParallel)]] is used to indicate that the code depends on the RcppParallel package. It will ensure that the compilation flags are set correctly to compile the code.
2: Include the RcppParallel.h header file to use the RVector and RMatrix classes.
3: Create a threadsafe wrapper to the input Rcpp vector.
4: Allocate memory for the output vector.
5: Create a threadsafe wrapper to the output vector.
6: Perform the desired transformation, possibly in parallel, using inputs stored in the input wrapper input and writing the results to the output wrapper output.
7: Return the output Rcpp vector.

{RcppParallel} in an R package

DESCRIPTION

Imports: RcppParallel
LinkingTo: Rcpp, RcppParallel
SystemRequirements: GNU make

NAMESPACE

importFrom(RcppParallel, RcppParallelLibs)

If you are using {roxygen2} to generate the NAMESPACE file, you can add the following line to the packagename-package.R file:

#' @importFrom RcppParallel RcppParallelLibs

which will automatically populate the NAMESPACE upon devtools::document().

src/Makevars

PKG_LIBS += $(shell ${R_HOME}/bin/Rscript -e "RcppParallel::RcppParallelLibs()")

src/Makevars.win

PKG_CXXFLAGS += -DRCPP_PARALLEL_USE_TBB=1
PKG_LIBS += $(shell "${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe" \
          -e "RcppParallel::RcppParallelLibs()")

Workflow

Now simply include the main RcppParallel.h header file in source files that need to use it:

#include <RcppParallel.h>

Error function (C++)

We will illustrate the use of {RcppParallel} by implementing the error function in C++.

//---------------------------------
#include <Rcpp.h>

// [[Rcpp::depends(BH)]]
#include <boost/math/special_functions/erf.hpp>

// [[Rcpp::export]]
Rcpp::NumericVector erf_cpp(Rcpp::NumericVector x) {
  Rcpp::NumericVector y(x.size());

  for (int i = 0; i < x.size(); i++) {
    y[i] = boost::math::erf(x[i]);
  }

  return y;
}

1: The Rcpp attribute [[Rcpp::depends(BH)]] is used to indicate that the code depends on the Boost C++ libraries. It will ensure that the compilation flags are set correctly to compile the code.
2: Include the boost/math/special_functions/erf.hpp header file to use the boost::math::erf() function.
3: Compute the error function for each element of the input vector x and store the results in the output vector y.

Error function (R)

Now, let us define the error function in R for comparison:

erf_r <- function(x) {
  2 * pnorm(x * sqrt(2)) - 1
}

Let us check that both R and C++ versions of erf() provide the same results:

x <- rnorm(1e6)
max(abs(erf_r(x) - erf_cpp(x)))

[1] 4.440892e-16

The numerical difference is of the order of the machine precision.

Error function (RcppParallel + OpenMP)

//---------------------------------
#include <Rcpp.h>

// [[Rcpp::depends(BH)]]
#include <boost/math/special_functions/erf.hpp>

// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>

// // [[Rcpp::plugins(openmp)]] // Remove `//` on Windows and Linux

// [[Rcpp::export]]
Rcpp::NumericVector erf_omp(Rcpp::NumericVector x, unsigned int ncores)
{
  unsigned int n = x.size();
  Rcpp::NumericVector out(n);

  RcppParallel::RVector<double> wo(out);
  RcppParallel::RVector<double> wx(x);

#ifdef _OPENMP
#pragma omp parallel for num_threads(ncores)
#endif
  for (unsigned int i = 0;i < n;++i)
    wo[i] = boost::math::erf(wx[i]);

  return out;
}

1: Use the threadsafe wrapper class RVector of {RcppParallel} to manipulate the output vector.
2: Use the threadsafe wrapper class RVector of {RcppParallel} to manipulate the input vector.
3: Insert the appropriate OpenMP clauses and directives.
4: Use the wrapped objects to perform the computation within the workers.

Error function (RcppParallel)

//---------------------------------
#include <Rcpp.h>

// [[Rcpp::depends(BH)]]
#include <boost/math/special_functions/erf.hpp>

// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>

struct ErfFunctor : public RcppParallel::Worker {
  // Threadsafe wrapper around input vector
  const RcppParallel::RVector<double> m_InputVector;

  // Threadsafe wrapper around output vector
  RcppParallel::RVector<double> m_OutputVector;

  // initialize with input and output vectors
  ErfFunctor(const Rcpp::NumericVector input, Rcpp::NumericVector output)
    : m_InputVector(input), m_OutputVector(output) {}

  // function call operator that work for the specified range (begin/end)
  void operator()(std::size_t begin, std::size_t end) {
    for (unsigned int i = begin;i < end;++i) {
      m_OutputVector[i] = boost::math::erf(m_InputVector[i]);
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericVector erf_parallel_impl(Rcpp::NumericVector x) {
  Rcpp::NumericVector y(x.size());

  ErfFunctor erfFunctor(x, y);
  RcppParallel::parallelFor(0, x.size(), erfFunctor);

  return y;
}

1: Define a functor class ErfFunctor that inherits from RcppParallel::Worker for later use with RcppParallel::parallelFor().
2: Define a first attribute m_InputVector for the functor class that is a threadsafe wrapper around an input vector.
3: Define a second attribute m_OutputVector for the functor class that is a threadsafe wrapper around an output vector.
4: Define a constructor for the functor class that takes an input and output vectors and initializes the two corresponding class attributes.
5: Define the function call operator operator() that will be called by RcppParallel::parallelFor() for the specific range that the worker will have to process.
6: Define the main function erf_parallel_impl() that will be exported to R and will be used to call the parallel computation. This function will create an instance of the functor class and call RcppParallel::parallelFor() to perform the parallel computation.

Error function - Control `ncores`

The previously defined function erf_parallel_impl() has no way to control the number of cores used for the computation. We can define a wrapper function that will set the number of cores before calling erf_parallel_impl() and reset it afterwards:

erf_parallel <- function(x, ncores) {
  on.exit(RcppParallel::setThreadOptions())
  RcppParallel::setThreadOptions(numThreads = ncores)
  erf_parallel_impl(x)
}

Benchmarking

Code

bm <- bench::mark(
  erf_r(x),
  erf_cpp(x),
  erf_omp(x, 1),
  erf_omp(x, 2),
  erf_omp(x, 4),
  erf_omp(x, 8),
  erf_parallel(x, 1),
  erf_parallel(x, 2),
  erf_parallel(x, 4),
  erf_parallel(x, 8),
  iterations = bch_runs, 
  time_unit = "ms"
)

Programming language	Expression	Median computation time	Memory allocation
R	erf_r(x)	27.07 ms	16 MB
C++	erf_cpp(x)	18.40 ms	8 MB
C++ (OpenMP)	erf_omp(x, 1)	17.97 ms	8 MB
C++ (OpenMP)	erf_omp(x, 2)	9.47 ms	8 MB
C++ (OpenMP)	erf_omp(x, 4)	5.33 ms	8 MB
C++ (OpenMP)	erf_omp(x, 8)	3.01 ms	8 MB
C++ (RcppParallel)	erf_parallel(x, 1)	19.33 ms	8 MB
C++ (RcppParallel)	erf_parallel(x, 2)	10.17 ms	8 MB
C++ (RcppParallel)	erf_parallel(x, 4)	5.46 ms	8 MB
C++ (RcppParallel)	erf_parallel(x, 8)	3.19 ms	8 MB

Progress report

Progress bars via {RcppProgress}

{RcppProgress}

The {RcppProgress} package provides a way to display progress bars in Rcpp code. This is useful when you have long-running C++ code and want to provide feedback to the user. It is also compatible with OpenMP parallel code.

Compatibility with {RcppParallel}

Progress bars generated by the {RcppProgress} package are not readily compatible with the {RcppParallel} framework. See this issue.

Progress bars with OpenMP

//---------------------------------
#include <Rcpp.h>

// // [[Rcpp::plugins(openmp)]] // Uncomment on Windows and Linux

// [[Rcpp::depends(BH)]]
#include <boost/math/special_functions/erf.hpp>

// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>

// [[Rcpp::depends(RcppProgress)]]
#include <progress.hpp>
#include <progress_bar.hpp>

// [[Rcpp::export]]
Rcpp::NumericVector erf_omp_progress(Rcpp::NumericVector x,
                                     unsigned int ncores,
                                     bool display_progress = false)
{
  unsigned int n = x.size();
  Rcpp::NumericVector out(n);
  Progress p(n, display_progress);

  RcppParallel::RVector<double> wo(out);
  RcppParallel::RVector<double> wx(x);

#ifdef _OPENMP
#pragma omp parallel for num_threads(ncores)
#endif
  for (unsigned int i = 0;i < n;++i)
  {
    if (!Progress::check_abort())
    {
      p.increment();
      wo[i] = boost::math::erf(wx[i]);
    }
  }

  return out;
}

1: The Rcpp attribute Rcpp::depends(RcppProgress) ensures that compilation flags to link to the {RcppProgress} headers are properly set.
2: Include the necessary headers to provide access to the Progress class.
3: Add a flag to turn on progress bar display as optional argument to your function which defaults to false.
4: Instantiate a progress bar via the Progress class which takes as arguments the number of total number of increments the bar should achieve and whether the progress bar should be displayed.
5: Within workers, check that user did not abort the calculation.
6: Within workers, increment the progress bar.

x <- rnorm(1e7)
y <- erf_omp_progress(x, 4, display_progress = TRUE)

{RcppThread}

Thread safe communication with R

Printing messages to the console

Problem. You might want to print out messages to the R console sometimes. {Rcpp} provides the Rcpp::Rcout replacement of std::cout which correctly places the messages in the R console. It is however not threadsafe.

Solution. {RcppThread} provides RcppThread::Rcout and RcppThread::Rcerr which are treadsafe.

Interrupting computations

Problem. It is good practice in long-running computations to allow the user to interrupt manually the computation. This needs to be handled on the developer side. {Rcpp} provides the Rcpp::checkUserInterrupt() function for this purpose but, as the rest of the API, it is not threadsafe.

Solution. {RcppThread} provides RcppThread::checkUserInterrupt which is treadsafe.

Multithreading with `std::thread`

The RcppThread::Thread class

{RcppThread}’s Thread class is an R-friendly wrapper to std::thread. Instances of class Thread behave almost like instances of std::thread. There is one important difference: Whenever child threads are running, the main thread periodically synchronizes with R. In particular, it checks for user interruptions and releases all messages passed to RcppThread::Rcout and RcppThread::Rcerr. When the user interrupts a threaded computation, any thread will stop as soon it encounters RcppThread::checkUserInterrupt().

Multithreading with `std::thread`

#include <Rcpp.h>

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppThread)]]
#include <RcppThread.h>

// [[Rcpp::export]]
void pyjamaParty()
{
  auto job = [] (int id) {
    std::this_thread::sleep_for(std::chrono::seconds(1));
    RcppThread::Rcout << id << " slept for one second" << std::endl;
    RcppThread::checkUserInterrupt();
    std::this_thread::sleep_for(std::chrono::seconds(1));
    RcppThread::Rcout << id << " slept for another second" << std::endl;
  };
  RcppThread::Thread t1(job, 1);
  RcppThread::Thread t2(job, 2);
  t1.join();
  t2.join();
}

1: Rcpp attribute to enable C++11 features.
2: Rcpp attribute to ensure that the package is linked to the {RcppThread} headers with the proper compilation flags.
3: Include the necessary header to provide access to the Thread class, RcppThread::Rcout, and RcppThread::checkUserInterrupt(). It also includes the standard library headers required for std::thread and std::chrono.
4: Define the task to be executed by the threads as a lambda function job which takes an integer id as argument and does the following: Sleep for one second, send a message, check for a user interruption, go back to sleep, and send another message.
5: Spawn two new Thread’s with this job and different id’s. Notice that the argument of the job function is passed to the ‘Thread’ constructor. More generally, if a job function takes arguments, they must be passed to the constructor as a comma-separated list.
6: Threads should always be joined before they are destructed. The .join() statements signal the main thread to wait until the jobs have finished. But instead of just waiting, the main thread starts synchronizing with R and checking for user interruptions.

Thread pool - Basic usage

#include <Rcpp.h>

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppThread)]]
#include <RcppThread.h>

// [[Rcpp::export]]
std::vector<unsigned int> rcpp_thread_example1(unsigned int n,
                                               unsigned int ncores)
{
  RcppThread::ThreadPool pool(ncores);
  std::vector<unsigned int> x(n);
  auto task = [&x] (unsigned int i) { x[i] = i; };
  for (unsigned int i = 0;i < x.size();++i)
    pool.push(task, i);
  pool.join();
  return x;
}

1: Create a thread pool with ncores threads. Useful when the number of tasks is known in advance.
2: Define the task to be executed by the threads as a lambda function task which takes an integer i as argument and assigns i to the i-th element of x. The lambda function captures x by reference. This is necessary because the lambda function is executed in a different context than the main thread.
3: Push the task to the thread pool for each element of x. The push method takes the task and its arguments as arguments.
4: Wait for all threads to finish. This is necessary because the main thread should not finish before the threads have finished. It also starts synchronizing with R and checking for user interruptions.

rcpp_thread_example1(10, 3)

 [1] 0 1 2 3 4 5 6 7 8 9

Thread pool which returns a value

#include <Rcpp.h>

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppThread)]]
#include <RcppThread.h>

// [[Rcpp::export]]
std::vector<unsigned int> rcpp_thread_example2(unsigned int n,
                                               unsigned int ncores)
{
  RcppThread::ThreadPool pool(ncores);

  std::vector<unsigned int> x(n);
  for (unsigned int i = 0;i < n;++i)
    x[i] = i + 1;

  auto task = [&x] (unsigned int i) {
    return x[i] * x[i];
  };

  std::vector<std::future<unsigned int>> futures(n);
  for (unsigned int i = 0;i < n;++i)
    futures[i] = pool.pushReturn(task, i);

  std::vector<unsigned int> results(n);
  for (unsigned int i = 0;i < n;++i)
    results[i] = futures[i].get();

  pool.join();

  return results;
}

1: Create a vector x of n elements and initialize it with the integers from 1 to n.
2: Define the task to be executed by the threads as a lambda function task which takes an integer i as argument and returns the square of the i-th element of x. The lambda function captures x by reference. This is necessary because the lambda function is executed in a different context than the main thread.
3: Create a vector of std::future objects to store the results of the tasks because the pushReturn() method returns a std::future object which can be used to later retrieve the result of the task.
4: Retrieve the results of the tasks from the std::future objects and store them in a vector results.
5: Wait for all threads to finish. This is necessary because the main thread should not finish before the threads have finished. It also starts synchronizing with R and checking for user interruptions.

rcpp_thread_example2(10, 3)

 [1]   1   4   9  16  25  36  49  64  81 100

Parallel for loop

#include <Rcpp.h>

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppThread)]]
#include <RcppThread.h>

// [[Rcpp::export]]
std::vector<unsigned int> parallelfor_example(unsigned int n)
{
  // Index-based
  std::vector<unsigned int> x(n);

  auto task = [&x] (unsigned int i) {
    x[i] = i;
  };

  RcppThread::parallelFor(0, x.size(), task);

  return x;
}

1: Execute the task for each element of x in parallel. The parallelFor() function takes the start index, end index, and the task as arguments.

parallelfor_example(10)

 [1] 0 1 2 3 4 5 6 7 8 9

Parallel for-each loop

#include <Rcpp.h>

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppThread)]]
#include <RcppThread.h>

// [[Rcpp::export]]
std::vector<unsigned int> parallelforeach_example(unsigned int n)
{
  // Over elements of a vector
  std::vector<unsigned int> x(n);
  for (unsigned int i = 0;i < n;++i)
    x[i] = i;

  auto task = [] (unsigned int &xx) {
    xx *= 2;
  };

  RcppThread::parallelForEach(x, task);

  return x;
}

1: Define the task to be executed by the threads as a lambda function task which takes an integer xx as argument and multiplies it by 2. The argument is passed by reference because the task modifies the argument.
2: Execute the task for each element of x in parallel. The parallelForEach() function takes the vector and the task as arguments and applies the task to each element of the vector.

parallelforeach_example(10)

 [1]  0  2  4  6  8 10 12 14 16 18

{RcppThread} in an R package

Using {RcppThread} in an R package is easy:

Add CXX_STD = CXX11 to the src/Makevars(.win) files of your package.
Add RcppThread to the LinkingTo field in the DESCRIPTION file.
Include the headers with #include "RcppThread.h" in your C++ source files within the src/ directory.

Progress report

#include <Rcpp.h>

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppThread)]]
#include <RcppThread.h>

// [[Rcpp::export]]
void pb_example()
{
  // 20 iterations in loop, update progress every 1 sec
  RcppThread::ProgressBar bar(20, 1);
  RcppThread::parallelFor(0, 20, [&] (int i) {
    std::this_thread::sleep_for(std::chrono::milliseconds(200));
    ++bar;
  });
}

1: Create a progress bar with 20 iterations and update the progress every 1 second.
2: Execute the task for each iteration in parallel. The task instructs the thread to sleep for 200 milliseconds and then increment the progress bar.

pb_example()


Computing: [======================                  ] 55%  (~0s remaining)       
Computing: [========================================] 100% (done)

References

“Beginning OpenMP.” http://chryswoods.com/beginning_openmp/.

Dagum, Leonardo, and Ramesh Menon. 1998. “OpenMP: An Industry Standard API for Shared-Memory Programming.” IEEE Computational Science and Engineering 5 (1): 46–55.

Nagler, Thomas. 2021. “R-Friendly Multi-Threading in c++.” Journal of Statistical Software, Code Snippets 97 (1): 1–18. https://doi.org/10.18637/jss.v097.c01.

“OpenMP API Reference Guide.” https://www.openmp.org/wp-content/uploads/OpenMPRefGuide-5.2-Web-2024.pdf.

“OpenMP Reference Sheet for C/C++.” n.d. https://cheat-sheets.org/saved-copy/OpenMP_reference.pdf.

Reinders, James. 2007. Intel Threading Building Blocks, Outfitting c++ for Multi-Core Processor Parallelism. O’Reilly Media, Inc.

Schäling, Boris. 2014. The Boost c++ Libraries. Vol. 3. XML press Laguna Hills.

Parallel computing with C++ from R

Information about the session

Benchmarkings

Running multi-threaded code in C++

Calling multi-threaded C++ code from R(cpp)

OpenMP

Overview

Enabling OpenMP

Enabling OpenMP

A toy function to play with

A parallelized version via OpenMP

Benchmarking

A more realistic example (R)

A C++ version

Parallelizing via OpenMP

Benchmarking

Cheatsheet for OpenMP

Cheatsheet for OpenMP

OpenMP in an R package

Thread safety

Two consequences for R users

Two big consequences for R users

Threadsafe random number generation via {sitmo}

Sum of uniform samples (R)

Sum of uniform samples (C++)

Sum of uniform samples (OpenMP)

Benchmarking

{RcppParallel}

Overview

Vector and matrix accessor classes

{RcppParallel} in an R package

Error function (C++)

Error function (R)

Error function (RcppParallel + OpenMP)

Error function (RcppParallel)

Error function - Control ncores

Benchmarking

Progress report

Progress bars via {RcppProgress}

Progress bars with OpenMP

{RcppThread}

Thread safe communication with R

Multithreading with std::thread

Multithreading with std::thread

Thread pool - Basic usage

Thread pool which returns a value

Parallel for loop

Parallel for-each loop

{RcppThread} in an R package

Progress report

References

Error function - Control `ncores`

Multithreading with `std::thread`

Multithreading with `std::thread`