Understanding the Performance of Binary Search and Vector Scan in data.table

In this article, we will explore the performance of binary search and vector scan operations on a data.table object. The question posed by the original poster seeks to understand why the “vector scan way” is slower than the native binary search method.

Introduction

The data.table package provides an efficient data structure for storing and manipulating large datasets in R. One of its key features is the ability to perform fast subset operations using vector scans or binary searches.

Background on Binary Search and Vector Scan

In data.table, binary search is implemented as a built-in function called SUBSET1. This function uses the setkey function to reorder the data by the specified columns, which allows for faster subset operations.

On the other hand, the “vector scan way” uses the == operator to perform a direct match between the values in the data and the values passed as arguments. This approach is often more readable but can be slower than using binary search.

Benchmarking Binary Search and Vector Scan

To compare the performance of these two approaches, we will use the microbenchmark package to run repeated measurements of both functions on a large dataset.

First, let’s create a sample dataset with 1 million rows:

library(data.table)
set.seed(2L)
N = 1e7L
DT = data.table(x = sample(letters, N, TRUE),
                y = sample(1000L, N, TRUE),
                val = runif(N))
setkey(DT, x, y)

Next, we will define the two functions for binary search and vector scan:

SUBSET1 <- function(){
  a <- DT[.(c("a"), c(5L)), .N, nomatch = NULL]
}

SUBSET2 <- function(){
  a <- DT[x == "a" & y == 5L, .N, nomatch = NULL]
}

Now, we will run the benchmark:

microbenchmark::microbenchmark(
  times = 500,
  SUBSET1(),
  SUBSET2()
)
# Unit: milliseconds
#    expr       min         lq     mean   median        uq      max neval
#  SUBSET1() 1.0328 1.2779 1.8784 1.5337 1.8924 20.5789   500
#  SUBSET2() 2.4896 3.0667 4.4769 3.5268 4.3682 179.1607   500

Understanding the Results

As we can see, SUBSET1 is approximately twice as fast as SUBSET2. However, SUBSET2 is still much faster than a true vector subset operation.

The main reason for this difference in performance is that data.table uses an internal optimization to convert the “vector scan way” into a binary search form. This conversion takes some time, but it allows data.table to leverage its existing indexing infrastructure to speed up subset operations.

Conclusion

In conclusion, while the “vector scan way” may be more readable and understandable than binary search, it is still slower due to the overhead of converting it into a binary search form. However, this conversion also enables data.table to take advantage of its indexing infrastructure, which leads to significant performance improvements.

As always, the choice between these two approaches depends on the specific use case and the trade-offs you are willing to make between readability and performance.

Additional Considerations

One additional consideration worth noting is that data.table has an option called datatable.auto.index, which can be set to FALSE to prevent the automatic indexing of data. When this option is enabled, subsequent subsets will use the existing index, but it also means that the table will not be indexed by default.

To get a more accurate comparison between binary search and vector scan, we can set this option to FALSE and reset any existing index:

options(datatable.auto.index = FALSE)
setindex(DTrand, NULL)

This allows us to compare the performance of these two approaches in a truly unbiased manner.

References

The official data.table vignette on secondary indices.
The data.table documentation for the SUBSET1 and SUBSET2 functions.

Last modified on 2025-02-06