Read the docs before questioning R’s defaults

The latest R tip in Win-Vector Blog encourages you to Use Radix Sort based on a simple benchmark showing a x35 speedup compared to the default method, but with no further explanation. In my opinion, though, the complete tip would be, instead, use radix sort… if you know what you are doing, because a quick benchmark shouldn’t spare you the effort of actually reading the docs. And here is a spoiler: you are already using it.

One may wonder why R’s default sorting algorithm is so bad, and why was even chosen. The thing is that there is a trick here, and to understand it, first we must understand the benchmark’s data and then read the docs. This is the function from the original code (slightly modified for subsequent reuse) that generates the data:

mk_data <- function(nrow, stringsAsFactors = FALSE) {
  alphabet <- paste("sym", seq_len(max(2, floor(nrow^(1/3)))), sep = "_")
  data.frame(col_a = sample(alphabet, nrow, replace=TRUE),
             col_b = sample(alphabet, nrow, replace=TRUE),
             col_c = sample(alphabet, nrow, replace=TRUE),
             col_x = runif(nrow),
             stringsAsFactors = stringsAsFactors)

d <- mk_data(1e+6)

##     col_a              col_b              col_c          
##  Length:1000000     Length:1000000     Length:1000000    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##      col_x          
##  Min.   :0.0000002  
##  1st Qu.:0.2496717  
##  Median :0.4991010  
##  Mean   :0.4996031  
##  3rd Qu.:0.7494089  
##  Max.   :0.9999999
## [1] 99

There are three character columns sampled from 99 symbols (sym_1sym_2, …, sym_99) and a numeric column sampled from a uniform. The first three columns are thus clearly factors, but they are not treated as such. Let’s see now what help(sort) has to tell us about the sorting method, which by default is method="auto":

The “auto” method selects “radix” for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, “shell”.

So, as I said in the opening paragraph, you are already using radix sort, except for characters. Let’s see then what happens if we treat such columns as proper factors:


d <- mk_data(1e+6, stringsAsFactors = TRUE)

timings <- microbenchmark(
  order_default = d[order(d$col_a, d$col_b, d$col_c, d$col_x), , 
                    drop = FALSE],
  order_radix = d[order(d$col_a, d$col_b, d$col_c, d$col_x,
                        method = "radix"), ,
                  drop = FALSE],
  times = 10L)

## Unit: milliseconds
##           expr      min       lq     mean   median       uq      max neval
##  order_default 289.4685 312.0257 388.5259 387.8308 418.2673 584.4771    10
##    order_radix 265.6491 321.8337 421.2072 376.1166 512.0047 667.0545    10
##  cld
##    a
##    a

Unsurprisingly, timings are the same, because R automatically selects "radix" for you when appropriate. But when is it considered appropriate and why isn’t it appropriate in general for character vectors? We should go back to the docs:

The implementation is orders of magnitude faster than shell sort for character vectors, in part thanks to clever use of the internal CHARSXP table.

However, there are some caveats with the radix sort:

  • If x is a character vector, all elements must share the same encoding. Only UTF-8 (including ASCII) and Latin-1 encodings are supported. Collation always follows the “C” locale.
  • Long vectors (with more than 2^32 elements) and complex vectors are not supported yet.

An there it is: R is doing the right thing for you for the general case. So let us round up the tip: enforce method="radix" for character vectors if you know what you are doing. And, please, do read the docs.

Publicada en R

5 comentarios sobre “Read the docs before questioning R’s defaults

  1. John Mount of Win-Vector LLC here. Mostly I agree with the above analysis. The default is the safer setting (which is something I value). In my applications I usually use string sorting to bring identical keysets together, so I don’t care that much about order details. Other users may need more detailed control of collation.

    Also, I (for some technical reasons) tend to use character types instead of factors many places (which is not that bad due to R‘s use of CHARSXP string interning).

  2. Thanks for the comment, John. Indeed, character vs. factor is one of these endless R discussions. I appreciate the fact that factors are basically integers. This an advantage for many operations (such as this one), but at the same time this led to some… peculiar, let’s say, decisions in the design of some base methods. I myself use characters instead many times due to technical reasons, as you said.

    Anyway, I just wanted to provide some more background to your otherwise good tip, because, as you may know, many people just don’t read the docs. :)

  3. And thanks for your correction. «R Tip: Consider radix Sort» would have been a better article. I am going to look into changing the title.

Comentarios cerrados.