Tidyverse and data.table, sitting side by side… and then base R walks in

Of course, I’m paraphrasing Dirk’s fifteenth post in the rarely rational R rambling series: #15: Tidyverse and data.table, sitting side by side … (Part 1). I very much liked it, because, although I’m a happy tidyverse user, I’m always trying not to be tied into that verse too much by replicating certain tasks with other tools (and languages) as an exercise. In this article, I’m going to repeat Dirk’s exercise in base R.

First of all, I would like to clean up the tidyverse version a little, because the original was distributed in chunks and was a little bit too verbose. We can also avoid using lubridate, because readr already parses the end_date column as a date (and that’s why it is significantly slower, among other reasons). This is how I would do it:

## Getting the polls

library(tidyverse)
library(zoo)

polls_2016 <- read_tsv(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"))

## Wrangling the polls

polls_2016 <- polls_2016 %>%
  filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) %>%
  right_join(data.frame(end_date = seq.Date(min(.$end_date), max(.$end_date), by="days")), by="end_date")

## Average the polls

rolling_average <- polls_2016 %>%
  group_by(end_date) %>%
  summarise(Clinton = mean(Clinton), Trump = mean(Trump)) %>%
  mutate(Clinton.Margin = Clinton-Trump,
         Clinton.Avg =  rollapply(Clinton.Margin,width=14,
                                  FUN=function(x){mean(x, na.rm=TRUE)},
                                  by=1, partial=TRUE, fill=NA, align="right"))

ggplot(rolling_average) +
  geom_line(aes(x=end_date, y=Clinton.Avg), col="blue") +
  geom_point(aes(x=end_date, y=Clinton.Margin))

which, by the way, has exactly the very same number of lines of code than the data.table version:

## Getting the polls

library(data.table)
library(zoo)
library(ggplot2)

pollsDT <- fread("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv")

## Wrangling the polls

pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsDT[, end_date := as.IDate(end_date)]
pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]),
                                              max(pollsDT[,end_date]), by="days")), on="end_date"]

## Average the polls

pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date]
pollsDT[, Clinton.Margin := Clinton-Trump]
pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14,
                                   FUN=function(x){mean(x, na.rm=TRUE)},
                                   by=1, partial=TRUE, fill=NA, align="right")]

ggplot(pollsDT) +
  geom_line(aes(x=end_date, y=Clinton.Avg), col="blue") +
  geom_point(aes(x=end_date, y=Clinton.Margin))

Let’s translate this into base R. It is easier to start from the data.table version, mainly because filtering and assigning have a similar look and feel. Unsurprisingly, we have base::merge for the merge operation and stats::aggregate for the aggregation phase. base::as.Date works just fine for these dates and utils::read.csv has the only drawback that you have to specify the separator. Without further ado, this is my version in base R:

## Getting the polls

library(zoo)

pollsB <- read.csv(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"), sep="\t")

## Wrangling the polls

pollsB <- pollsB[pollsB$sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsB$end_date <- base::as.Date(pollsB$end_date)
endDate <- data.frame(end_date = seq.Date(min(pollsB$end_date), max(pollsB$end_date), by="days"))
pollsB <- merge(pollsB, endDate, by="end_date", all=TRUE)

## Average the polls

pollsB <- aggregate(cbind(Clinton, Trump) ~ end_date, data=pollsB, mean, na.action=na.pass)
pollsB$Clinton.Margin <- pollsB$Clinton - pollsB$Trump
pollsB$Clinton.Avg <- rollapply(pollsB$Clinton.Margin, width=14,
                                FUN=function(x){mean(x, na.rm=TRUE)},
                                by=1, partial=TRUE, fill=NA, align="right")

plot(pollsB$end_date, pollsB$Clinton.Margin, pch=16)
lines(pollsB$end_date, pollsB$Clinton.Avg, col="blue", lwd=2)

which is the shortest one! Finally, let’s repeat the benchmark too:

library(microbenchmark)

url <- "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"
file <- "/tmp/poll-responses-clean.tsv"
download.file(url, destfile=file, quiet=TRUE)
res <- microbenchmark(tidy=suppressMessages(readr::read_tsv(file)),
                      dt=data.table::fread(file, showProgress=FALSE),
                      base=read.csv(file, sep="\t"))
res
## Unit: milliseconds
##  expr       min        lq      mean    median        uq        max neval
##  tidy 13.877036 15.127885 18.549393 15.861311 17.813541 202.389391   100
##    dt  4.084022  4.505943  5.152799  4.845193  5.652579   7.736563   100
##  base 29.029366 30.437742 32.518009 31.449916 33.600937  45.104599   100

Base R is clearly the slowest option for the reading phase. Or, one might say, both readr and data.table have done a great job in improving things! Let’s take a look at the processing part now:

tvin <- suppressMessages(readr::read_tsv(file))
dtin <- data.table::fread(file, showProgress=FALSE)
bsin <- read.csv(file, sep="\t")

library(tidyverse)
library(data.table)
library(zoo)

transformTV <- function(polls_2016) {
  polls_2016 <- polls_2016 %>%
    filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) %>%
    right_join(data.frame(end_date = seq.Date(min(.$end_date), max(.$end_date), by="days")), by="end_date")
  
  rolling_average <- polls_2016 %>%
    group_by(end_date) %>%
    summarise(Clinton = mean(Clinton), Trump = mean(Trump)) %>%
    mutate(Clinton.Margin = Clinton-Trump,
           Clinton.Avg =  rollapply(Clinton.Margin,width=14,
                                    FUN=function(x){mean(x, na.rm=TRUE)},
                                    by=1, partial=TRUE, fill=NA, align="right"))
}

transformDT <- function(dtin) {
  pollsDT <- copy(dtin) ## extra work to protect from reference semantics for benchmark
  pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
  pollsDT[, end_date := as.IDate(end_date)]
  pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]),
                                                max(pollsDT[,end_date]), by="days")), on="end_date"]
  
  pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date]
  pollsDT[, Clinton.Margin := Clinton-Trump]
  pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14,
                                     FUN=function(x){mean(x, na.rm=TRUE)},
                                     by=1, partial=TRUE, fill=NA, align="right")]
}

transformBS <- function(pollsB) {
  pollsB <- pollsB[pollsB$sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
  pollsB$end_date <- base::as.Date(pollsB$end_date)
  endDate <- data.frame(end_date = seq.Date(min(pollsB$end_date), max(pollsB$end_date), by="days"))
  pollsB <- merge(pollsB, endDate, by="end_date", all=TRUE)
  
  pollsB <- aggregate(cbind(Clinton, Trump) ~ end_date, data=pollsB, mean, na.action=na.pass)
  pollsB$Clinton.Margin <- pollsB$Clinton - pollsB$Trump
  pollsB$Clinton.Avg <- rollapply(pollsB$Clinton.Margin, width=14,
                                  FUN=function(x){mean(x, na.rm=TRUE)},
                                  by=1, partial=TRUE, fill=NA, align="right")
}

res <- microbenchmark(tidy=transformTV(tvin),
                      dt=transformDT(dtin),
                      base=transformBS(bsin))
res
## Unit: milliseconds
##  expr      min       lq     mean   median       uq       max neval
##  tidy 20.68435 22.58603 26.67459 24.56170 27.85844  84.55077   100
##    dt 17.25547 18.88340 21.43256 20.24450 22.26448  41.65252   100
##  base 28.39796 30.93722 34.94262 32.97987 34.98222 109.14005   100

I don’t see so much difference between the tidyverse and data.table as Dirk showed, perhaps because I’ve simplified the script a bit, and removed some redundant parts. Again, base R is the slowest option, but don’t set it aside: it is the shortest one, and it is always there, out of the box!

simmer.bricks 0.1.0: new add-on for simmer

The new package simmer.bricks has found its way to CRAN. The simmer package provides a rich and flexible API to build discrete-event simulations. However, there are certain recurring patterns that are typed over and over again, higher-level tasks which can be conceptualised in concrete activity sequences. This new package is intended to capture this kind of patterns into usable bricks, i.e., methods that can be used as simmer activities, but return an arrangement of activities implementing higher-level tasks.

For instance, consider an entity visiting a resource:

library(simmer)

trajectory("customer") %>%
  seize("clerk") %>%
  timeout(10) %>%
  release("clerk")
## trajectory: customer, 3 activities
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Timeout      | delay: 10 }
## { Activity: Release      | resource: clerk, amount: 1 }

The simmer.bricks package wraps this pattern into the visit() brick:

library(simmer.bricks)

trajectory("customer") %>%
  visit("clerk", 10)
## trajectory: customer, 3 activities
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Timeout      | delay: 10 }
## { Activity: Release      | resource: clerk, amount: 1 }

This is a very naive example though. As a more compelling use case, consider a resource that becomes inoperative for some time after each release; i.e., the clerk above needs to do some paperwork after each customer leaves. There are several ways of programming this with simmer. The most compact implementation requires a clone() activity to let a clone hold the resource for some more time while the original entity continues its way. This package encapsulates all this logic in a very easy-to-use brick called delayed_release():

env <- simmer()

customer <- trajectory("customer") %>%
  log_("waiting") %>%
  seize("clerk") %>%
  log_("being attended") %>%
  timeout(10) %>%
  # inoperative for 5 units of time
  delayed_release(env, "clerk", 5) %>%
  log_("leaving")

env %>%
  add_resource("clerk") %>%
  add_generator("customer", customer, at(0, 1)) %>%
  run() %>% invisible
## 0: customer0: waiting
## 0: customer0: being attended
## 1: customer1: waiting
## 10: customer0: leaving
## 15: customer1: being attended
## 25: customer1: leaving

The reference index lists all the available bricks included in this inital release. The examples included in the help page for each method show the equivalence in plain activities. This is very important if you want to mix bricks with rollbacks to produce loops, since the rollback() activity works in terms of the number of activities. For instance, this is what a delayed_release() does behind the scenes:

customer
## trajectory: customer, 11 activities
## { Activity: Log          | message }
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Log          | message }
## { Activity: Timeout      | delay: 10 }
## { Activity: Clone        | n: 2 }
##   Fork 1, continue,  trajectory: anonymous, 2 activities
##   { Activity: SetCapacity  | resource: clerk, value: 0x55a7c5b524c0 }
##   { Activity: Release      | resource: clerk, amount: 1 }
##   Fork 2, continue,  trajectory: anonymous, 2 activities
##   { Activity: Timeout      | delay: 5 }
##   { Activity: SetCapacity  | resource: clerk, value: 0x55a7c59ddc18 }
## { Activity: Synchronize  | wait: 0 }
## { Activity: Log          | message }

As always, we are more than happy to receive feedback and suggestions, either via the mailing list or via GitHub issues and PRs. If you identify any pattern that you frequently use in your simulations and you think it could become a useful simmer brick, please don’t hesitate to share it!

simmer 3.6.5

The fifth update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN. This release extends the attributes API by allowing users to set/get multiple attributes at once (a pretty straightforward change as well as useful; I don’t know why it didn’t occurred to me before…). Vectors as attributes and other data types are not supported yet, but they are on the roadmap.

This version also fixes some minor bugs (many thanks to the users of the simmer-devel mailing list for taking their simulations to edge cases, where these bugs arise), deprecates the onestep() function and provides the new stepn() instead. Since onestep() serves primarily for debugging purposes, the transition to the new one may go unnoticed. Finally, there is a new vignette about the Dining Philosophers Problem.

New features:

  • set_attribute() (and set_global() by extension) can set multiple attributes at once by providing vectors of keys and values (or functions returning such keys and/or values). get_attribute() (and get_global() by extension) can retrieve multiple keys (#122).
  • New stepn() method deprecates onestep() (e452975).

Minor changes and fixes:

  • Restore ostream after formatting (9ff11f8).
  • Fix arrival cloning to copy attributes over to the clone (#118).
  • Fix self-induced preemption through set_capacity() (#125).
  • Update “Queueing Systems” vignette (a0409a0, 8f03f4f).
  • Update “Advanced Trajectory Usage” vignette (4501927).
  • Fix print methods to return the object invisibly (#128).
  • New “Dining Philosophers Problem” vignette (ff6137e).

Visualising SSH attacks with R

If you have any machine with an SSH server open to the world and you take a look at your logs, you may be alarmed to see so many login attempts from so many unknown IP addresses. DenyHosts is a pretty neat service for Unix-based systems which works in the background reviewing such logs and appending the offending addresses into the hosts.deny file, thus avoiding brute-force attacks.

The following R snippet may be useful to quickly visualise a hosts.deny file with logs from DenyHosts. Such file may have comments (lines starting with #), and actual records are stored in the form <service>: <IP>. Therefore, read.table is more than enough to load it into R. The rgeolocate package is used to geolocate the IPs, and the counts per country are represented in a world map using rworldmap:

library(dplyr)
library(rgeolocate)
library(rworldmap)
hosts.deny <- "/etc/hosts.deny"
db <- system.file("extdata", "GeoLite2-Country.mmdb", package="rgeolocate")
read.table(hosts.deny, col.names=c("service", "IP")) %>%
  pull(IP) %>%
  maxmind(db, fields="country_code") %>%
  count(country_code) %>%
  as.data.frame() %>%
  joinCountryData2Map(joinCode="ISO2", nameJoinColumn="country_code") %>%
  mapCountryData(nameColumnToPlot="n", catMethod="pretty", mapTitle="Attacks per country")
## 74 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 168 codes from the map weren't represented in your data

Then, you may consider more specific access restrictions based on IP prefixes…

simmer 3.6.4

The fourth update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN. This release patches several bugs regarding resource priority and preemption management when seized amounts greater than one were involved. Check the examples available in the corresponding issues on GitHub (#114#115#116) to know if you are affected.

It can be noted that we already broke the intended bi-monthly release cycle, but it is for a good reason, since we are preparing a journal publication. Further details to come.

Minor changes and fixes:

  • Fix preemption in non-saturated multi-server resources when seizing amounts > 1 (#114).
  • Fix queue priority in non-saturated finite-queue resources when seizing amounts > 1 (#115).
  • Fix resource seizing: avoid jumping the queue when there is room in the server but other arrivals are waiting (#116).