simmer 4.1.0

The 4.1.0 release of simmer, the Discrete-Event Simulator for R, is on CRAN. As per request in the mailing list, now get_global() is able to work inside a generator function. Moreover, the new add_global() method attaches a global attribute to a simulator.

library(simmer)

env <- simmer()

hello_sayer <- trajectory() %>%
  log_("hello world!") %>%
  set_global("interarrival", 1, mod="+")

generator <- function() get_global(env, "interarrival")

env %>%
  add_global("interarrival", 1) %>%
  add_generator("dummy", hello_sayer, generator) %>%
  run(7) %>%
  get_global("interarrival")
## 1: dummy0: hello world!
## 3: dummy1: hello world!
## 6: dummy2: hello world!
## [1] 4

Compared to plain global variables, these ones are automatically managed and thus reinitialised if the environment is reset.

env %>%
  reset() %>%
  get_global("interarrival")
## [1] 1
env %>%
  run(7) %>%
  get_global("interarrival")
## 1: dummy0: hello world!
## 3: dummy1: hello world!
## 6: dummy2: hello world!
## [1] 4

There has been a small refactoring of some parts of the C++ core, which motivates the minor version bump, but this shouldn’t be noticeable to the users. Finally, several bug fixes and improvements complete this release. See below for a complete list.

New features:

  • New getter get_selected() retrieves names of selected resources via the select() activity (#172 addressing #171).
  • Source and resource getters have been vectorised to retrieve parameters from multiple entities (as part of #172).
  • Simplify C++ Simulator interface for adding processes and resources (#162). The responsibility of building the objects has been moved to the caller.
  • New add_global() method to attach global attributes to a simulation environment (#174 addressing #158).

Minor changes and fixes:

  • Remove 3.8.0 and 4.0.1 deprecations (#170 addressing #165).
  • Fix get_global() to work outside trajectories (#170 addressing #165).
  • Fix rollback() with an infinite amount (#173).
  • Fix and improve schedules and managers (as part of #174).
  • Fix reset() to avoid overwriting the simulation environment (#175).

simmer 4.0.1

The 4.0.1 release of simmer, the Discrete-Event Simulator for R, is on CRAN since a couple of weeks ago. There are few changes, notably new getters (get_sources()get_resources()get_trajectory()) for simmer environments and some improvements in resource selection policies (see details in help(select)).

A new convenience function, when_activated, makes it easier to generate arrivals on demand, triggered from trajectories. Let us consider, for instance, a simple restocking pattern:

library(simmer)

restock <- trajectory() %>%
  log_("restock")

serve <- trajectory() %>%
  log_("serve") %>%
  activate("Restock")

env <- simmer() %>%
  add_generator("Customer", serve, at(1, 2, 3)) %>%
  add_generator("Restock", restock, when_activated()) %>%
  run()
## 1: Customer0: serve
## 1: Restock0: restock
## 2: Customer1: serve
## 2: Restock1: restock
## 3: Customer2: serve
## 3: Restock2: restock

Finally, this release leverages the new fast evaluation framework offered by Rcpp (>= 0.12.18) by default, and includes some minor improvements and bug fixes.

New features:

  • New getters (#159):
    • get_sources() and get_resources() retrieve a character vector of source/resource names defined in a simulation environment.
    • get_trajectory() retrieves a trajectory to which a given source is attached.
  • New resource selection policies: shortest-queue-availableround-robin-availablerandom-available (#156). These are the same as the existing non-available ones, but they exclude unavailable resources (capacity set to zero). Thus, if all resources are unavailable, an error is raised.

Minor changes and fixes:

  • Rename -DRCPP_PROTECTED_EVAL (Rcpp >= 0.12.17.4) as -DRCPP_USE_UNWIND_PROTECT (6d27671).
  • Keep compilation quieter with -DBOOST_NO_AUTO_PTR (70328b6).
  • Improve log_ print (7c2e3b1).
  • Add when_activated() convenience function to easily generate arrivals on demand from trajectories (#161 closing #160).
  • Enhance schedule printing (9c66285).
  • Fix generator-manager name clashing (#163).
  • Deprecate set_attribute(global=TRUE)get_attribute(global=TRUE) and timeout_from_attribute(global=TRUE) (#164), the *_global versions should be used instead.

Read the docs before questioning R’s defaults

The latest R tip in Win-Vector Blog encourages you to Use Radix Sort based on a simple benchmark showing a x35 speedup compared to the default method, but with no further explanation. In my opinion, though, the complete tip would be, instead, use radix sort… if you know what you are doing, because a quick benchmark shouldn’t spare you the effort of actually reading the docs. And here is a spoiler: you are already using it.

One may wonder why R’s default sorting algorithm is so bad, and why was even chosen. The thing is that there is a trick here, and to understand it, first we must understand the benchmark’s data and then read the docs. This is the function from the original code (slightly modified for subsequent reuse) that generates the data:

mk_data <- function(nrow, stringsAsFactors = FALSE) {
  alphabet <- paste("sym", seq_len(max(2, floor(nrow^(1/3)))), sep = "_")
  data.frame(col_a = sample(alphabet, nrow, replace=TRUE),
             col_b = sample(alphabet, nrow, replace=TRUE),
             col_c = sample(alphabet, nrow, replace=TRUE),
             col_x = runif(nrow),
             stringsAsFactors = stringsAsFactors)
}

set.seed(32523)
d <- mk_data(1e+6)

summary(d)
##     col_a              col_b              col_c          
##  Length:1000000     Length:1000000     Length:1000000    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##      col_x          
##  Min.   :0.0000002  
##  1st Qu.:0.2496717  
##  Median :0.4991010  
##  Mean   :0.4996031  
##  3rd Qu.:0.7494089  
##  Max.   :0.9999999
length(table(d$col_a))
## [1] 99

There are three character columns sampled from 99 symbols (sym_1sym_2, …, sym_99) and a numeric column sampled from a uniform. The first three columns are thus clearly factors, but they are not treated as such. Let’s see now what help(sort) has to tell us about the sorting method, which by default is method="auto":

The “auto” method selects “radix” for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, “shell”.

So, as I said in the opening paragraph, you are already using radix sort, except for characters. Let’s see then what happens if we treat such columns as proper factors:

library(microbenchmark)

set.seed(32523)
d <- mk_data(1e+6, stringsAsFactors = TRUE)

timings <- microbenchmark(
  order_default = d[order(d$col_a, d$col_b, d$col_c, d$col_x), , 
                    drop = FALSE],
  order_radix = d[order(d$col_a, d$col_b, d$col_c, d$col_x,
                        method = "radix"), ,
                  drop = FALSE],
  times = 10L)

print(timings)
## Unit: milliseconds
##           expr      min       lq     mean   median       uq      max neval
##  order_default 289.4685 312.0257 388.5259 387.8308 418.2673 584.4771    10
##    order_radix 265.6491 321.8337 421.2072 376.1166 512.0047 667.0545    10
##  cld
##    a
##    a

Unsurprisingly, timings are the same, because R automatically selects "radix" for you when appropriate. But when is it considered appropriate and why isn’t it appropriate in general for character vectors? We should go back to the docs:

The implementation is orders of magnitude faster than shell sort for character vectors, in part thanks to clever use of the internal CHARSXP table.

However, there are some caveats with the radix sort:

  • If x is a character vector, all elements must share the same encoding. Only UTF-8 (including ASCII) and Latin-1 encodings are supported. Collation always follows the “C” locale.
  • Long vectors (with more than 2^32 elements) and complex vectors are not supported yet.

An there it is: R is doing the right thing for you for the general case. So let us round up the tip: enforce method="radix" for character vectors if you know what you are doing. And, please, do read the docs.

Boost the speed of R calls from Rcpp

If you are a user who needs to work with Rcpp-based packages, or you are a maintainer of one of such packages, you may be interested in the recent development of the unwind API, which can be leveraged to boost performance since the last Rcpp update. In a nutshell, until R 3.5.0, every R call from C++ code was executed inside a try-catch, which is really slow, to avoid breaking things apart. From v3.5.0 on, this API provides a new and safe fast evaluation path for such calls.

Some motivation

Here is a small comparison of the old and the new APIs. The following toy example just calls an R function N times from C++. A pure R for loop is also provided as a reference.

Rcpp::cppFunction('
  void old_api(Function func, int n) {
    for (int i=0; i<n; i++) func();
  }
')

Rcpp::cppFunction(plugins = "unwindProtect", '
  void new_api(Function func, int n) {
    for (int i=0; i<n; i++) func();
  }
')

reference <- function(func, N) {
  for (i in 1:N) func()
}

func <- function() 1
N <- 1e6

system.time(old_api(func, N))
##    user  system elapsed 
##  17.863   0.006  17.950
system.time(new_api(func, N))
##    user  system elapsed 
##   0.289   0.000   0.290
system.time(reference(func, N))
##    user  system elapsed 
##   0.216   0.000   0.217

Obviously, there is still some penalty compared to not switching between domains, but the performance gain with respect to the old API is outstanding.

A real-world example

This is a quite heavy simulation of an M/M/1 system using simmer:

library(simmer)

system.time({
  mm1 <- trajectory() %>%
    seize("server", 1) %>%
    timeout(function() rexp(1, 66)) %>%
    release("server", 1)

  env <- simmer() %>%
    add_resource("server", 1) %>%
    add_generator("customer", mm1, function() rexp(50, 60), mon=F) %>%
    run(10000, progress=progress::progress_bar$new()$update)
})

In my system, it takes around 17 seconds with the old API. The new API makes it in less than 5 seconds. As a reference, if we avoid R calls in the timeout activity and precompute all the arrivals instead of defining a dynamic generator, i.e.:

system.time({
  input <- data.frame(
    time = rexp(10000*60, 60),
    service = rexp(10000*60, 66)
  )

  mm1 <- trajectory() %>%
    seize("server", 1) %>%
    timeout_from_attribute("service") %>%
    release("server", 1)

  env <- simmer() %>%
    add_resource("server", 1) %>%
    add_dataframe("customer", mm1, input, mon=F, batch=50) %>%
    run(10000, progress=progress::progress_bar$new()$update)
})

then the simulation takes around 2.5 seconds.

How to start using this feature

First of all, you need R >= 3.5.0 and Rcpp >= 0.12.18 installed. Then, if you are a user, the easiest way to enable this globally is to add CPPFLAGS += -DRCPP_USE_UNWIND_PROTECT to your ~/.R/Makevars. Packages installed or re-installed, as well as functions compiled with Rcpp::sourceCpp and Rcpp::cppFunction, will benefit from this performance gains. If you are a package maintainer, you can add -DRCPP_USE_UNWIND_PROTECT to your package’s PKG_CPPFLAGS in src/Makevars. Alternatively, there is a plugin available, so this flag can be enabled by adding [[Rcpp::plugins(unwindProtect)]] to one of your source files.

Note that this is fairly safe according to reverse dependency checks, but there might be still issues in some packages. But the sooner we start testing this feature and reporting possible issues, the sooner it will be enabled by default in Rcpp.

simmer 4.0.0

The 4.0.0 release of simmer, the Discrete-Event Simulator for R, is on CRAN under a new license: we decided to switch to GPL >= 2. Most notably in this major release, the C++ core has been refactorised and exposed under inst/include. This is not a big deal for most users, but it enables extensions. As an example of this, simmer.mon is an experimental package that links to simmer and extends its monitoring facilities to provide a new DBI-based backend. Not a very efficient one, but it demonstrates how to extend the simmer core from another package.

Exception handling has been remarkably improved. In previous releases, errors were reported to happen in the run() method, which is… everything that can happen, obviously. In this version, errors are catched and more information is provided, particularly about the simulation time, the arrival and the activity involved:

library(simmer)

bad.traj <- trajectory() %>%
  timeout(function() NA)

simmer() %>%
  add_generator("dummy", bad.traj, at(pi)) %>%
  run()
## Error: 'dummy0' at 3.14 in 'Timeout':
##  missing value (NA or NaN returned)

Another improvement has to do with attributes. These are commonly used to build incremental indices, but some boilerplate was needed to initialise them. Now this is automatic (and configurable):

index.traj <- trajectory() %>%
  set_global("index", 1, mod="+", init=10)

simmer() %>%
  add_generator("dummy", index.traj, at(1:3), mon=2) %>%
  run() %>%
  get_mon_attributes()
##   time name   key value replication
## 1    1      index    11           1
## 2    2      index    12           1
## 3    3      index    13           1

Finally, the log_ activity was created for occasional debugging, but we noticed that simmer users use it a lot more to know what is happening when they build models, but so much output is annoying when a model is complete. Therefore, we have implemented simulation-scoped logging levels to be able to turn on and off specific messages on demand:

log.traj <- trajectory() %>%
  log_("This will be always printed") %>% # level=0
  log_("This can be disabled", level=1)

simmer(log_level=1) %>%
  add_generator("dummy", log.traj, at(pi)) %>%
  run() %>% invisible()
## 3.14159: dummy0: This will be always printed
## 3.14159: dummy0: This can be disabled
simmer() %>% # log_level=0
  add_generator("dummy", log.traj, at(pi)) %>%
  run() %>% invisible()
## 3.14159: dummy0: This will be always printed

See below for a comprehensive list of changes.

New features:

  • The C++ core has been refactorised into a header-only library under inst/include (#147 closing #145). Therefore, from now on it is possible to extend the C++ API from another package by listing simmer under the LinkingTo field in the DESCRIPTION file.
  • New generic monitor constructor enables the development of new monitoring backends in other packages (179f656, as part of #147).
  • New simulation-scoped logging levels. The log_ activity has a new argument level which determines whether the message is printed depending on a global log_level defined in the simmer constructor (#152).
  • set_attribute and set_global gain a new argument to automatically initialise new attributes (#157). Useful to update counters and indexes in a single line, without initialisation boilerplate.

Minor changes and fixes:

  • Enhanced exception handling, with more informative error messages (#148).
  • Refactorisation of the printing methods and associated code (#149).
  • Allow empty trajectories in sources and activities with sub-trajectories (#151 closing #150).
  • Enable -DRCPP_PROTECTED_EVAL (Rcpp >= 0.12.17.3), which provides fast evaluation of R expressions by leveraging the new stack unwinding protection API (R >= 3.5.0).
  • Replace backspace usage in vector’s ostream method (2b2f43e).
  • Fix namespace clashes with rlang and purrr (#154).