Vroom Benchmarks

This is a revision of vroom’s official benchmarks, found at https://vroom.r-lib.org/articles/benchmarks.html. They have been modified to include several Apache Arrow-based projects for comparison. Additions to the original document’s text are indicated in bold.

Reading delimited files

The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.

Because the read.delim results are so much slower than the others they are excluded from the plots, but are retained in the tables.

Taxi Trip Dataset

This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at http://chriswhong.com/open-data/foil_nyc_taxi/. It is also hosted on archive.org.

The first table trip_fare_1.csv is 1.55G in size.

#> Observations: 14,776,615
#> Variables: 11
#> $ medallion       <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license    <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id       <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type    <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount     <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge       <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax         <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount    <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...

Taxi Benchmarks

code: bench/taxi

All benchmarks were run on a Amazon EC2 m5.4xlarge instance with 16 vCPUs and an EBS volume type.

In order to test the GPU-based cudf library, these benchmarks were run on an NVIDIA DGX workstation with the following specifications:

CPU:

  • Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  • 256 GB RAM (2400 MHz), SSD

GPU:

  • 4 NVIDIA V100 GPUs, 5,120 CUDA cores each
  • 32 GB memory per GPU

R and all necessary system dependencies are installed via conda. Benchmarks were run with development versions of arrow and cudf and released versions of all other packages.

The benchmarks labeled vroom_base uses vroom with base functions for manipulation. vroom_dplyr uses vroom to read the file and dplyr functions to manipulate. data.table uses fread() to read the file and data.table functions to manipulate and readr uses readr to read the file and dplyr to manipulate. By default vroom only uses Altrep for character vectors, these are labeled vroom(altrep: normal). The benchmarks labeled vroom(altrep: full) instead use Altrep vectors for all supported types and vroom(altrep: none) disable Altrep entirely.

The arrow benchmarks are divided into three types: (1) read into an R data.frame using arrow and then all subsequent processing happens with dplyr; (2) same but using data.table to aggregate in R; (3) data is held in an Arrow Table and computed on in Arrow to the extent possible (as of writing, this means selecting and filtering happen in Arrow but aggregation requires pulling a window of data into an R data.frame.)

cudf is a Python package, here called from R using reticulate, and results are pulled into R using the arrow R package’s methods.

The following operations are performed.

  • The data is read
  • print() - N.B. read.delim uses print(head(x, 10)) because printing the whole dataset takes > 10 minutes
  • head()
  • tail()
  • Sampling 100 random rows
  • Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).
  • Aggregation of mean fare amount per payment type.
reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 13.1GB 1m 22.6s 7ms 1ms 1ms 1ms 1.2s 895ms 1m 24.8s
readr dplyr 13.1GB 35.5s 95ms 1ms 1ms 10ms 217ms 498ms 36.3s
vroom dplyr FALSE 12.7GB 17.6s 103ms 1ms 1ms 9ms 946ms 1.2s 19.9s
data.table data.table 13.7GB 19.4s 16ms 1ms 1ms 1ms 145ms 226ms 19.8s
vroom base TRUE 14.3GB 676ms 118ms 1ms 1ms 1ms 1.5s 10.6s 12.9s
arrow dplyr 25.4GB 6.6s 433ms 657ms 231ms 179ms 382ms 598ms 9.1s
arrow data.table 26.7GB 6.5s 14ms 1ms 1ms 1ms 120ms 778ms 7.4s
vroom dplyr TRUE 14.3GB 694ms 127ms 1ms 1ms 10ms 1.7s 4.6s 7.1s
cudf cudf 30.4GB 2.8s 229ms 4ms 4ms 7ms 1.1s 130ms 4.2s
arrow arrow 24.9GB 723ms 45ms 1ms 2ms 610ms 309ms 1.4s 3.1s

(N.B. Rcpp used in the dplyr implementation fully materializes all the Altrep numeric vectors when using filter() or sample_n(), which is why the first of these cases have additional overhead when using full Altrep.).

All numeric data

All numeric data is really a worst case scenario for vroom. The index takes about as much memory as the parsed data. Also because parsing doubles can be done quickly in parallel and text representations of doubles are only ~25 characters at most there isn’t a great deal of savings for delayed parsing.

For these reasons (and because the data.table implementation is very fast) vroom is a bit slower than fread for pure numeric data.

However because vroom is multi-threaded it is a bit quicker than readr and read.delim for this type of data.

Long

code: bench/all_numeric-long

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 10.81GB 1m 42.2s 1.3s 1ms 1ms 1ms 4s 44ms 1m 47.5s
readr dplyr 8.9GB 15s 117ms 1ms 1ms 10ms 18ms 92ms 15.3s
cudf cudf 25.33GB 2.1s 271ms 5ms 4ms 9ms 29ms 35ms 2.4s
vroom base FALSE 8.76GB 1s 127ms 1ms 1ms 3ms 12ms 109ms 1.3s
vroom dplyr FALSE 8.77GB 950ms 138ms 1ms 1ms 11ms 22ms 58ms 1.2s
vroom dplyr TRUE 9.29GB 192ms 234ms 1ms 1ms 12ms 49ms 270ms 756ms
vroom base TRUE 9.05GB 271ms 172ms 1ms 1ms 3ms 37ms 269ms 750ms
arrow dplyr 12.89GB 539ms 111ms 1ms 1ms 11ms 18ms 61ms 739ms
arrow arrow 12.89GB 345ms 71ms 1ms 2ms 67ms 55ms 158ms 697ms
data.table data.table 9.92GB 169ms 14ms 1ms 1ms 3ms 9ms 26ms 220ms

Wide

code: bench/all_numeric-wide

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 20.5GB 7m 49.1s 174ms 9ms 9ms 10ms 90ms 6ms 7m 49.4s
readr dplyr 11.4GB 1m 5.3s 163ms 3ms 3ms 31ms 16ms 41ms 1m 5.6s
arrow dplyr 18.3GB 12s 155ms 2ms 2ms 20ms 16ms 50ms 12.2s
arrow arrow 17.6GB 3.7s 422ms 3ms 146ms 1.6s 4s 128ms 10s
vroom dplyr FALSE 11.3GB 6.5s 191ms 2ms 3ms 27ms 18ms 48ms 6.7s
vroom base FALSE 11.3GB 6.4s 216ms 3ms 3ms 5ms 6ms 8ms 6.6s
cudf cudf 29.2GB 3.8s 388ms 39ms 38ms 57ms 89ms 27ms 4.5s
data.table data.table 12.7GB 960ms 143ms 7ms 7ms 8ms 8ms 5ms 1.1s
vroom dplyr TRUE 13.3GB 642ms 199ms 4ms 4ms 20ms 23ms 70ms 958ms
vroom base TRUE 13.2GB 663ms 232ms 4ms 4ms 5ms 13ms 36ms 954ms

All character data

code: bench/all_character-long

All character data is a best case scenario for vroom when using Altrep, as it takes full advantage of the lazy reading.

Long

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 10.49GB 1m 29.8s 9ms 1ms 1ms 2ms 30ms 434ms 1m 30.3s
readr dplyr 10.29GB 53.6s 107ms 1ms 1ms 11ms 16ms 371ms 54.1s
vroom dplyr FALSE 10.26GB 45.7s 107ms 1ms 1ms 10ms 16ms 309ms 46.1s
data.table data.table 11.47GB 36s 18ms 1ms 1ms 4ms 18ms 290ms 36.4s
arrow dplyr 24.77GB 33.1s 110ms 1ms 1ms 11ms 16ms 310ms 33.5s
cudf cudf 25.75GB 2.2s 306ms 6ms 5ms 12ms 100ms 53ms 2.7s
vroom base TRUE 8.99GB 278ms 126ms 1ms 1ms 2ms 161ms 2.1s 2.6s
arrow arrow 23.18GB 241ms 70ms 1ms 2ms 112ms 64ms 1.3s 1.8s
vroom dplyr TRUE 8.97GB 179ms 150ms 1ms 1ms 9ms 172ms 1s 1.6s

Wide

code: bench/all_character-wide

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 19.1GB 7m 55.5s 192ms 9ms 9ms 24ms 193ms 62ms 7m 56s
readr dplyr 18.3GB 4m 19.6s 159ms 2ms 3ms 23ms 31ms 70ms 4m 19.8s
vroom dplyr FALSE 17.8GB 3m 21.4s 160ms 2ms 3ms 23ms 31ms 56ms 3m 21.7s
arrow dplyr 25.9GB 2m 27.5s 155ms 2ms 3ms 24ms 31ms 57ms 2m 27.8s
data.table data.table 19.2GB 2m 24.5s 200ms 1ms 1ms 25ms 132ms 28ms 2m 24.9s
arrow arrow 18.6GB 6.9s 479ms 3ms 123ms 2.5s 4.7s 166ms 14.9s
cudf cudf 28.6GB 6.6s 573ms 75ms 76ms 200ms 241ms 33ms 7.8s
vroom base TRUE 12.6GB 647ms 175ms 4ms 4ms 5ms 44ms 180ms 1.1s
vroom dplyr TRUE 12.6GB 559ms 154ms 4ms 4ms 19ms 55ms 115ms 907ms

Reading multiple delimited files

code: bench/taxi_multiple

The benchmark reads all 12 files in the taxi trip fare data, totaling 173,179,759 rows and 11 columns for a total file size of 18.4G.

reading package manipulating package altrep memory read print head tail sample filter aggregate total
readr dplyr 63.8GB 7m 56.6s 93ms 1ms 1ms 9ms 3.8s 12.7s 8m 13.1s
data.table data.table 66.1GB 5m 6.3s 7ms 1ms 1ms 1ms 1.1s 11s 5m 18.4s
vroom dplyr FALSE 62.4GB 3m 47s 1.7s 1ms 1ms 9ms 10.8s 6.6s 4m 6.2s
arrow arrow 96.2GB 72ms 36ms 15ms 10.5s 2m 13.5s 10.3s 18.2s 2m 52.5s
vroom base TRUE 83GB 5.7s 2.5s 1ms 1ms 1ms 21.5s 2m 19.2s 2m 48.9s
vroom dplyr TRUE 82.7GB 5.6s 2.4s 1ms 1ms 8ms 24.2s 58.5s 1m 30.8s

Reading fixed width files

Arrow is not included in these benchmarks because currently it does not support reading fixed-width files.

United States Census 5-Percent Public Use Microdata Sample files

This fixed width dataset contains individual records of the characteristics of a 5 percent sample of people and housing units from the year 2000 and is freely available at https://www2.census.gov/census_2000/datasets/PUMS/FivePercent/California/all_California.zip. The data is split into files by state, and the state of California was used in this benchmark.

The data totals 2,342,339 rows and 37 columns with a total file size of 677M.

Census data benchmarks

code: bench/fwf

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 13.96GB 17m 33.8s 20ms 1ms 2ms 3ms 347ms 93ms 17m 34.3s
readr dplyr 11.69GB 30s 124ms 1ms 1ms 14ms 79ms 86ms 30.3s
vroom dplyr FALSE 11.4GB 13.1s 122ms 1ms 1ms 14ms 419ms 84ms 13.7s
vroom base TRUE 9.37GB 149ms 159ms 1ms 1ms 4ms 224ms 1.6s 2.1s
vroom dplyr TRUE 10.14GB 153ms 158ms 1ms 1ms 14ms 237ms 940ms 1.5s

Writing delimited files

Arrow is not included in these benchmarks because it currently only writes Feather and Parquet format, not CSV

code: bench/taxi_writing

The benchmarks write out the taxi trip dataset in a few different ways.

compression base data.table readr vroom
gzip 4m 18.8s 1m 18.5s 2m 47.4s 1m 27.5s
multithreaded_gzip 2m 0.8s 6.2s 1m 22.8s 6.7s
zstandard 2m 6.2s NA 1m 22.9s 10s
uncompressed 2m 9.5s 9.1s 1m 30.8s 9.2s

Session and package information

package version date source
base 4.0.2 2020-07-21 local
data.table 1.12.8 2019-12-09 CRAN (R 4.0.0)
dplyr 1.0.0 2020-05-29 CRAN (R 4.0.0)
readr 1.3.1 2018-12-21 CRAN (R 4.0.0)
vroom 1.2.1 2020-05-12 CRAN (R 4.0.0)