Boost the speed of R calls from Rcpp

If you are a user who needs to work with Rcpp-based packages, or you are a maintainer of one of such packages, you may be interested in the recent development of the unwind API, which can be leveraged to boost performance since the last Rcpp update. In a nutshell, until R 3.5.0, every R call from C++ code was executed inside a try-catch, which is really slow, to avoid breaking things apart. From v3.5.0 on, this API provides a new and safe fast evaluation path for such calls.

Some motivation

Here is a small comparison of the old and the new APIs. The following toy example just calls an R function N times from C++. A pure R for loop is also provided as a reference.

Rcpp::cppFunction('
  void old_api(Function func, int n) {
    for (int i=0; i<n; i++) func();
  }
')

Rcpp::cppFunction(plugins = "unwindProtect", '
  void new_api(Function func, int n) {
    for (int i=0; i<n; i++) func();
  }
')

reference <- function(func, N) {
  for (i in 1:N) func()
}

func <- function() 1
N <- 1e6

system.time(old_api(func, N))
##    user  system elapsed 
##  17.863   0.006  17.950
system.time(new_api(func, N))
##    user  system elapsed 
##   0.289   0.000   0.290
system.time(reference(func, N))
##    user  system elapsed 
##   0.216   0.000   0.217

Obviously, there is still some penalty compared to not switching between domains, but the performance gain with respect to the old API is outstanding.

A real-world example

This is a quite heavy simulation of an M/M/1 system using simmer:

library(simmer)

system.time({
  mm1 <- trajectory() %>%
    seize("server", 1) %>%
    timeout(function() rexp(1, 66)) %>%
    release("server", 1)

  env <- simmer() %>%
    add_resource("server", 1) %>%
    add_generator("customer", mm1, function() rexp(50, 60), mon=F) %>%
    run(10000, progress=progress::progress_bar$new()$update)
})

In my system, it takes around 17 seconds with the old API. The new API makes it in less than 5 seconds. As a reference, if we avoid R calls in the timeout activity and precompute all the arrivals instead of defining a dynamic generator, i.e.:

system.time({
  input <- data.frame(
    time = rexp(10000*60, 60),
    service = rexp(10000*60, 66)
  )

  mm1 <- trajectory() %>%
    seize("server", 1) %>%
    timeout_from_attribute("service") %>%
    release("server", 1)

  env <- simmer() %>%
    add_resource("server", 1) %>%
    add_dataframe("customer", mm1, input, mon=F, batch=50) %>%
    run(10000, progress=progress::progress_bar$new()$update)
})

then the simulation takes around 2.5 seconds.

How to start using this feature

First of all, you need R >= 3.5.0 and Rcpp >= 0.12.18 installed. Then, if you are a user, the easiest way to enable this globally is to add CPPFLAGS += -DRCPP_USE_UNWIND_PROTECT to your ~/.R/Makevars. Packages installed or re-installed, as well as functions compiled with Rcpp::sourceCpp and Rcpp::cppFunction, will benefit from this performance gains. If you are a package maintainer, you can add -DRCPP_USE_UNWIND_PROTECT to your package’s PKG_CPPFLAGS in src/Makevars. Alternatively, there is a plugin available, so this flag can be enabled by adding [[Rcpp::plugins(unwindProtect)]] to one of your source files.

Note that this is fairly safe according to reverse dependency checks, but there might be still issues in some packages. But the sooner we start testing this feature and reporting possible issues, the sooner it will be enabled by default in Rcpp.

simmer 4.0.0

The 4.0.0 release of simmer, the Discrete-Event Simulator for R, is on CRAN under a new license: we decided to switch to GPL >= 2. Most notably in this major release, the C++ core has been refactorised and exposed under inst/include. This is not a big deal for most users, but it enables extensions. As an example of this, simmer.mon is an experimental package that links to simmer and extends its monitoring facilities to provide a new DBI-based backend. Not a very efficient one, but it demonstrates how to extend the simmer core from another package.

Exception handling has been remarkably improved. In previous releases, errors were reported to happen in the run() method, which is… everything that can happen, obviously. In this version, errors are catched and more information is provided, particularly about the simulation time, the arrival and the activity involved:

library(simmer)

bad.traj <- trajectory() %>%
  timeout(function() NA)

simmer() %>%
  add_generator("dummy", bad.traj, at(pi)) %>%
  run()
## Error: 'dummy0' at 3.14 in 'Timeout':
##  missing value (NA or NaN returned)

Another improvement has to do with attributes. These are commonly used to build incremental indices, but some boilerplate was needed to initialise them. Now this is automatic (and configurable):

index.traj <- trajectory() %>%
  set_global("index", 1, mod="+", init=10)

simmer() %>%
  add_generator("dummy", index.traj, at(1:3), mon=2) %>%
  run() %>%
  get_mon_attributes()
##   time name   key value replication
## 1    1      index    11           1
## 2    2      index    12           1
## 3    3      index    13           1

Finally, the log_ activity was created for occasional debugging, but we noticed that simmer users use it a lot more to know what is happening when they build models, but so much output is annoying when a model is complete. Therefore, we have implemented simulation-scoped logging levels to be able to turn on and off specific messages on demand:

log.traj <- trajectory() %>%
  log_("This will be always printed") %>% # level=0
  log_("This can be disabled", level=1)

simmer(log_level=1) %>%
  add_generator("dummy", log.traj, at(pi)) %>%
  run() %>% invisible()
## 3.14159: dummy0: This will be always printed
## 3.14159: dummy0: This can be disabled
simmer() %>% # log_level=0
  add_generator("dummy", log.traj, at(pi)) %>%
  run() %>% invisible()
## 3.14159: dummy0: This will be always printed

See below for a comprehensive list of changes.

New features:

  • The C++ core has been refactorised into a header-only library under inst/include (#147 closing #145). Therefore, from now on it is possible to extend the C++ API from another package by listing simmer under the LinkingTo field in the DESCRIPTION file.
  • New generic monitor constructor enables the development of new monitoring backends in other packages (179f656, as part of #147).
  • New simulation-scoped logging levels. The log_ activity has a new argument level which determines whether the message is printed depending on a global log_level defined in the simmer constructor (#152).
  • set_attribute and set_global gain a new argument to automatically initialise new attributes (#157). Useful to update counters and indexes in a single line, without initialisation boilerplate.

Minor changes and fixes:

  • Enhanced exception handling, with more informative error messages (#148).
  • Refactorisation of the printing methods and associated code (#149).
  • Allow empty trajectories in sources and activities with sub-trajectories (#151 closing #150).
  • Enable -DRCPP_PROTECTED_EVAL (Rcpp >= 0.12.17.3), which provides fast evaluation of R expressions by leveraging the new stack unwinding protection API (R >= 3.5.0).
  • Replace backspace usage in vector’s ostream method (2b2f43e).
  • Fix namespace clashes with rlang and purrr (#154).

simmer 3.8.0

The 3.8.0 release of simmer, the Discrete-Event Simulator for R, hit CRAN almost a week ago, and Windows binaries are already available. This version includes two highly requested new features that justify this second consecutive minor release.

Attachment of precomputed data

Until v3.7.0, the generator was the only means to attach data to trajectories, and it was primarily intended for dynamic generation of arrivals:

library(simmer)
set.seed(42)

hello_sayer <- trajectory() %>%
  log_("hello!")

simmer() %>%
  add_generator("dummy", hello_sayer, function() rexp(1, 1)) %>%
  run(until=2)
## 0.198337: dummy0: hello!
## 0.859232: dummy1: hello!
## 1.14272: dummy2: hello!
## 1.18091: dummy3: hello!
## 1.65409: dummy4: hello!
## simmer environment: anonymous | now: 2 | next: 3.11771876826972
## { Monitor: in memory }
## { Source: dummy | monitored: 1 | n_generated: 6 }

Although it may be used to attach precomputed data too, especially using the at() adaptor:

simmer() %>%
  add_generator("dummy", hello_sayer, at(seq(0, 10, 0.5))) %>%
  run(until=2)
## 0: dummy0: hello!
## 0.5: dummy1: hello!
## 1: dummy2: hello!
## 1.5: dummy3: hello!
## simmer environment: anonymous | now: 2 | next: 2
## { Monitor: in memory }
## { Source: dummy | monitored: 1 | n_generated: 21 }

Now, let’s say that we want to attach some empirical data, and our observations not only include arrival times, but also priorities and some attributes (e.g., measured service times), as in this question on StackOverflow:

myData <- data.frame(
  time = c(1:10,1:5), 
  priority = 1:3, 
  duration = rnorm(15, 50, 5)) %>%
  dplyr::arrange(time)

This is indeed possible using generators, but it requires some trickery; more specifically, the clever usage of a consumer function as follows:

consume <- function(x, prio=FALSE) {
  i <- 0
  function() {
    i <<- i + 1
    if (prio) c(x[[i]], x[[i]], FALSE)
    else x[[i]]
  }
}

activityTraj <- trajectory() %>%
  seize("worker") %>%
  timeout_from_attribute("duration") %>%
  release("worker")

initialization <- trajectory() %>%
  set_prioritization(consume(myData$priority, TRUE)) %>%
  set_attribute("duration", consume(myData$duration)) %>%
  join(activityTraj)

arrivals_gen <- simmer() %>%
  add_resource("worker", 2, preemptive=TRUE) %>%
  add_generator("dummy_", initialization, at(myData$time)) %>%
  run() %>%
  get_mon_arrivals()

# check the resulting duration times
activity_time <- arrivals_gen %>%
  tidyr::separate(name, c("prefix", "n"), convert=TRUE) %>%
  dplyr::arrange(n) %>%
  dplyr::pull(activity_time)

all(activity_time == myData$duration)
## [1] TRUE

Since this v3.8.0, the new data source add_dataframe greatly simplifies this process:

arrivals_df <- simmer() %>%
  add_resource("worker", 2, preemptive=TRUE) %>%
  add_dataframe("dummy_", activityTraj, myData, time="absolute") %>%
  run() %>%
  get_mon_arrivals()

identical(arrivals_gen, arrivals_df)
## [1] TRUE

On-disk monitoring

As some users noted (see 12), the default in-memory monitoring capabilities can turn problematic for very long simulations. To address this issue, the simmer() constructor gains a new argument, mon, to provide different types of monitors. Monitoring is still performed in-memory by default, but as of v3.8.0, it can be offloaded to disk through monitor_delim() and monitor_csv(), which produce flat delimited files.

mon <- monitor_csv()
mon
## simmer monitor: to disk (delimited files)
## { arrivals: /tmp/RtmpAlQH2g/file6933ce99281_arrivals.csv }
## { releases: /tmp/RtmpAlQH2g/file6933ce99281_releases.csv }
## { attributes: /tmp/RtmpAlQH2g/file6933ce99281_attributes.csv }
## { resources: /tmp/RtmpAlQH2g/file6933ce99281_resources.csv }
env <- simmer(mon=mon) %>%
  add_generator("dummy", hello_sayer, function() rexp(1, 1)) %>%
  run(until=2)
## 0.26309: dummy0: hello!
## 0.982183: dummy1: hello!
env
## simmer environment: anonymous | now: 2 | next: 2.29067480322535
## { Monitor: to disk (delimited files) }
##   { arrivals: /tmp/RtmpAlQH2g/file6933ce99281_arrivals.csv }
##   { releases: /tmp/RtmpAlQH2g/file6933ce99281_releases.csv }
##   { attributes: /tmp/RtmpAlQH2g/file6933ce99281_attributes.csv }
##   { resources: /tmp/RtmpAlQH2g/file6933ce99281_resources.csv }
## { Source: dummy | monitored: 1 | n_generated: 3 }
read.csv(mon$handlers["arrivals"]) # direct access
##     name start_time  end_time activity_time finished
## 1 dummy0  0.2630904 0.2630904             0        1
## 2 dummy1  0.9821828 0.9821828             0        1
get_mon_arrivals(env)              # adds the "replication" column
##     name start_time  end_time activity_time finished replication
## 1 dummy0  0.2630904 0.2630904             0        1           1
## 2 dummy1  0.9821828 0.9821828             0        1           1

See below for a comprehensive list of changes.

New features:

  • New data source add_dataframe enables the attachment of precomputed data, in the form of a data frame, to a trajectory. It can be used instead of (or along with) add_generator. The most notable advantage over the latter is that add_dataframe is able to automatically set attributes and prioritisation values per arrival based on columns of the provided data frame (#140 closing #123).
  • New set_source activity deprecates set_distribution(). It works both for generators and data sources (275a09c, as part of #140).
  • New monitoring interface allows for disk offloading. The simmer() constructor gains a new argument mon to provide different types of monitors. By default, monitoring is performed in-memory, as usual. Additionally, monitoring can be offloaded to disk through monitor_delim and monitor_csv, which produce flat delimited files. But more importantly, the C++ interface has been refactorised to enable the development of new monitoring backends (#146 closing #119).

Minor changes and fixes:

  • Some documentation improvements (1e14ed7, 194ed05).
  • New default until=Inf for the run method (3e6aae9, as part of #140).
  • branch and clone now accept lists of trajectories, in the same way as join, so that there is no need to use do.call (#142).
  • The argument continue (present in seize and branch) is recycled if only one value is provided but several sub-trajectories are defined (#143).
  • Fix process reset: sources are reset in strict order of creation (e7d909b).
  • Fix infinite timeouts (#144).

simmer 3.7.0

The 3.7.0 release of simmer, the Discrete-Event Simulator for R, is on CRAN. It includes several API improvements and bug fixes. Among the former, the new timeout_from_attribute() activity makes it easier and more efficient the common task of placing a timeout based on a previously set attribute. Another common task is to increment or decrement a given attribute. To this end, set_attribute and other setters get a new argument mod which, if set to "+" or "*", modifies the value correspondingly instead of substituting it.

This minor release also includes some minor breaking changes. In particular, all deprecations from the v3.6.x series have been finally removed, which should come as no surprise. Besides, get_mon_resources() loses the data argument, which was there for historical reasons and probably nobody was using it.

Finally, there are two additional vignettes:

  • “simmer: Discrete-Event Simulation for R” describes the internal design, the R API, provides some modelling examples and a performance evaluation. We are very proud to officially announce that it has been accepted for publication in the Journal of Statistical Software.
  • “Design and Analysis of 5G Scenarios” contains supplementary materials for a homonymous paper that has been accepted for publication in the IEEE Communications Magazine.

See the citation information for further details.

New features:

  • New timeout_from_attribute() activity makes it easier to set a timeout based on an attribute (#129).
  • The activities set_attribute()set_prioritization()set_capacity() and set_queue_size() get a new argument mod which, if set to "+" or "*", modifies the corresponding value instead of substituting it. This makes it easier to increment, decrement or scale one of these values (#130).
  • New *_selected() versions for the already available resource getters: get_capacity()get_queue seize()get_server_count()and get_queue_count() (#134).

Minor changes and fixes:

  • Broadcast signals with higher priority to prevent an arrival to catch its own signal with a trap() after a send() (#135).
  • Generate new arrivals with minimum priority to avoid wrong interactions with simultaneous activities (#136).
  • Remove v3.6.x deprecations: the old attribute retrieval system (see notes for v3.6.3), as well as methods create_trajectory() and onestep() (#117).
  • Remove get_mon_resources()’s data argument. It was there for historical reasons and probably nobody was using it (851d34b).
  • New vignette, “simmer: Discrete-Event Simuation for R”, paper accepted for publication in the Journal of Statistical Software. Remove “Terminology” vignette (#127).
  • New vignette, “Design and Analysis of 5G Scenarios”, supplementary materials for a paper accepted for publication in the IEEE Communications Magazine (#137).

Documenting R packages: roxygen2 vs. direct Rd input

As the reader may know,

R objects are documented in files written in “R documentation” (Rd) format, a simple markup language much of which closely resembles (La)TeX, which can be processed into a variety of formats, including LaTeX, HTML and plain text.

This LaTeX-like syntax, combined with the fact that the actual R objects live in a separate place, feels burdensome for many developers. As a consequence, there are a handful of tools aimed at improving the documentation process, one of which is roxygen2. We may say that the R community nowadays is divided between those who use roxygen2 and those who don’t.

The roxygen2 package allows us to write documentation right next to the code that is being described with decorated comments. The advantages are the following:

  • Code and documentation are adjacent so when you modify your code, it’s easy to remember that you need to update the documentation.
  • Roxygen2 dynamically inspects the objects that it’s documenting, so it can automatically add data that you’d otherwise have to write by hand.
  • It abstracts over the differences in documenting S3 and S4 methods, generics and classes so you need to learn fewer details.

Although both roxygenists and non-roxygenists surely agree that documentation is one of the most important aspects of good code, the alleged benefits of roxygen2 could turn into a disadvantage. In the words of Duncan Murdoch,

This isn’t the fashionable point of view, but I think it is easier to get good documentation [by directly editing Rd files] than using Roxygen. […]

The reason I think this is that good documentation requires work and thought. You need to think about the markup that will get your point across, you need to think about putting together good examples, etc. This is harder in Roxygen than if you are writing Rd files, because Roxygen is a thin front end to produce Rd files from comments in your .R files. To get good stuff in the help page, you need just as much work as in writing the .Rd file directly, but then you need to add another layer on top to put in in a comment. Most people don’t bother.

Basically, roxygen2’s point is that you don’t need to work in the syntax, so that you can use that time to write actual documentation. Duncan’s point, instead, is that, if you don’t put effort in the writing process, there’s a chance that you won’t put any effort at all. Although I’m a happy roxygen2 user, I can see there’s a point in there, and an interesting analysis to be done.

In fact, if you happen to have an uncompressed copy of CRAN under, let’s say, ~/cran, you can execute the following script:

## Requires: r-lib/pkgdown, readr
setwd("~/cran")

get_lines <- function(Rd) {
  # render as txt
  txt <- try(capture.output(tools::Rd2txt(Rd)), silent=TRUE)
  if (inherits(txt, "try-error")) # "rcqp" throws an error, why?
    return(c(documentation=NA, examples=NA))
  # remove blank lines
  txt <- txt[!grepl("^[[:space:]]*$", txt)]
  # split documentation and examples
  examples <- grep("_\bE_\bx_\ba_\bm_\bp_\bl_\be_\bs:", txt)
  if (length(examples)) {
    doc <- txt[1:(examples-1)]
    exm <- txt[(examples+1):length(txt)]
  } else {
    doc <- txt
    exm <- NULL
  }
  # remove titles
  doc <- doc[!grepl("_\b", doc)]
  # output
  c(documentation=length(doc), examples=length(exm))
}

do.call(rbind, parallel::mclapply(Sys.glob("*"), function(pkg) {
  message("Parsing ", pkg, "...")
  rds <- Sys.glob(file.path(pkg, "man", "*.[R|r]d"))
  if (!length(rds))
    df <- data.frame(documentation=0, examples=0, functions=0)
  else {
    # get no. lines for documentation & examples
    df <- as.data.frame(t(rowSums(sapply(rds, get_lines), na.rm=TRUE)))
    # get no. exported functions
    df$functions <- sum(sapply(rds, function(rd) {
      rd <- pkgdown:::rd_file(rd)
      length(pkgdown:::usage_funs(pkgdown:::topic_usage(rd)))
    }))
  }
  # RoxygenNote present?
  desc <- file.path(pkg, "DESCRIPTION")
  df$roxygen <- !is.na(read.dcf(desc, fields="RoxygenNote")[[1]])
  df$pkg <- pkg
  df
}, mc.cores=parallel::detectCores())) -> docLines

readr::write_csv(docLines, "docLines.csv")

to get this data frame. For each package on CRAN, we extract the number of lines of documentation and examples under the man directory, as rendered by tools::Rd2txt. We also count how many functions are documented, and we scan the DESCRIPTION file looking for the RoxygenNote, to tell which packages use roxygen2. This is all I need to see what I was looking for:

library(ggplot2)
library(dplyr)
library(tidyr)

docLines <- read.csv("docLines.csv") %>%
  filter(functions > 0) %>%
  gather("type", "lines", documentation, examples)

ggplot(docLines, aes(lines/functions, color=roxygen, fill=roxygen)) + theme_bw() + 
  geom_density(alpha=.3) + facet_wrap(~type) + scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 221 rows containing non-finite values (stat_density).

Limitations:

  • This talks about quantity, but not about quality.
  • The method of extraction of documentation and examples is very coarse. For sure there are better ways.
  • The amount of documentation must be weighted in some way. Just dividing it by the number of exported functions and methods may not be the best way.
  • roxygen2 appeared in 2011, but I think it became more popular in recent years. It may be interesting to restrict the analysis to recent packages.
  • Some developers prioritise vignettes over examples. It may be another interesting factor to analyse.

But all in all, I believe that this simple analysis proves Duncan right to some extent. And as a roxygen2 user that very much cares about documentation, this warns me against my own biases. If you care too, make sure that you really take advantage of the time you save with roxygen2.