Last week I finished preparing new teaching material for a workshop on R optimization techniques. Some of the examples were motivated by benchmarking tests I ran on data manipulation methods with base
vs dplyr
vs data.table
.
Not surprisingly, data.table
was generally most efficient in terms of speed when running the microbenchmark
tests.
One oddball result did creep up …
When comparing dplyr::filter()
versus a base method (using bracket indices and which()
), I noticed that the former was quite a bit slower for a data frame with 5000 rows. But that result reversed as the size of the data frame increased … in other words, filter()
was actually much faster for a larger dataset.
First we can load the packages we’ll be using:
library(microbenchmark)
library(dplyr)
library(ggplot2)
library(nycflights13)
Next create a sample of n = 5000 rows of the nycflights13
dataset:
flights_sample <-
sample_n(flights, 5000)
Then run the benchmark on the sample of n = 5000 rows:
# run benchmark on n = 5000
microbenchmark(
bracket = flights_sample[which(flights_sample$dep_delay > 0),],
filter = filter(flights_sample, dep_delay > 0),
times = 10
)
## Unit: microseconds
## expr min lq mean median uq max neval
## bracket 223.183 246.329 310.3268 290.7345 315.310 632.726 10
## filter 265.342 352.911 387.6371 371.1365 382.088 673.066 10
Now run the benchmark on the full n = 336776 rows:
microbenchmark(
bracket = flights[which(flights$dep_delay > 0),],
filter = filter(flights, dep_delay > 0),
times = 10
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## bracket 8.844108 9.041099 12.35393 12.58248 13.52227 20.33769 10
## filter 8.001824 8.159017 12.65863 12.27404 13.27080 21.73494 10
My guess is that there’s some overhead to using filter()
… but this clearly pays off eventually. So how big (i.e. how many rows) does a data frame need to be for filter()
to be faster?
# create sequence of number of rows to sample by 20000
sample_seq <- seq(5000, nrow(flights), by = 20000)
# set up empty data frame that will store results
mbm <- data_frame()
# loop through the samples and calculate microbencharmk in ms
for(n in sample_seq) {
flights_sample <- sample_n(flights, n)
tmpmbm <-
microbenchmark(
bracket = flights_sample[which(flights_sample$dep_delay > 0),],
filter = filter(flights_sample, dep_delay > 0),
times = 10,
unit = "ms"
)
mbm <- rbind(mbm, summary(tmpmbm))
}
# add the sample n as a column for plotting
mbm$n <- rep(sample_seq, each = 2)
The plot below shows the threshold (number of rows on y axis) at which filter()
is faster (ratio < 1 on x axis). It also demonstrates that the absolute amount of time starts leveling off as the datasets get bigger.
mbm %>%
select(method = expr,
nrows = n,
time = median) %>%
group_by(nrows) %>%
mutate(ratio = time / lag(time)) %>%
filter(!is.na(ratio)) %>%
ggplot(aes(nrows, ratio, size = time)) +
geom_point() +
geom_hline(yintercept = 1, col = "firebrick", lwd = 2) +
scale_x_reverse(breaks = sample_seq) +
xlab("number of rows") +
ylab("ratio of time\ndplyr::filter() / base") +
coord_flip() +
theme_minimal() +
guides(size = guide_legend(title = "time\n(milliseconds)"))