This is a revision of vroom
’s official benchmarks, found at https://vroom.r-lib.org/articles/benchmarks.html. They have been modified to include several Apache Arrow-based projects for comparison. Additions to the original document’s text are indicated in bold.
The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.
Because the read.delim
results are so much slower than the others they are excluded from the plots, but are retained in the tables.
This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at http://chriswhong.com/open-data/foil_nyc_taxi/. It is also hosted on archive.org.
The first table trip_fare_1.csv is 1.55G in size.
#> Observations: 14,776,615
#> Variables: 11
#> $ medallion <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...
code: bench/taxi
All benchmarks were run on a Amazon EC2 m5.4xlarge instance with 16 vCPUs and an EBS volume type.
In order to test the GPU-based cudf
library, these benchmarks were run on an NVIDIA DGX workstation with the following specifications:
CPU:
GPU:
R and all necessary system dependencies are installed via conda
. Benchmarks were run with development versions of arrow
and cudf
and released versions of all other packages.
The benchmarks labeled vroom_base
uses vroom
with base functions for manipulation. vroom_dplyr
uses vroom
to read the file and dplyr functions to manipulate. data.table
uses fread()
to read the file and data.table
functions to manipulate and readr
uses readr
to read the file and dplyr
to manipulate. By default vroom only uses Altrep for character vectors, these are labeled vroom(altrep: normal)
. The benchmarks labeled vroom(altrep: full)
instead use Altrep vectors for all supported types and vroom(altrep: none)
disable Altrep entirely.
The arrow
benchmarks are divided into three types: (1) read into an R data.frame using arrow
and then all subsequent processing happens with dplyr
; (2) same but using data.table
to aggregate in R; (3) data is held in an Arrow Table and computed on in Arrow to the extent possible (as of writing, this means selecting and filtering happen in Arrow but aggregation requires pulling a window of data into an R data.frame
.)
cudf
is a Python package, here called from R using reticulate
, and results are pulled into R using the arrow
R package’s methods.
The following operations are performed.
print()
- N.B. read.delim uses print(head(x, 10))
because printing the whole dataset takes > 10 minuteshead()
tail()
reading package | manipulating package | altrep | memory | read | head | tail | sample | filter | aggregate | total | |
---|---|---|---|---|---|---|---|---|---|---|---|
read.delim | base | 13.1GB | 1m 22.6s | 7ms | 1ms | 1ms | 1ms | 1.2s | 895ms | 1m 24.8s | |
readr | dplyr | 13.1GB | 35.5s | 95ms | 1ms | 1ms | 10ms | 217ms | 498ms | 36.3s | |
vroom | dplyr | FALSE | 12.7GB | 17.6s | 103ms | 1ms | 1ms | 9ms | 946ms | 1.2s | 19.9s |
data.table | data.table | 13.7GB | 19.4s | 16ms | 1ms | 1ms | 1ms | 145ms | 226ms | 19.8s | |
vroom | base | TRUE | 14.3GB | 676ms | 118ms | 1ms | 1ms | 1ms | 1.5s | 10.6s | 12.9s |
arrow | dplyr | 25.4GB | 6.6s | 433ms | 657ms | 231ms | 179ms | 382ms | 598ms | 9.1s | |
arrow | data.table | 26.7GB | 6.5s | 14ms | 1ms | 1ms | 1ms | 120ms | 778ms | 7.4s | |
vroom | dplyr | TRUE | 14.3GB | 694ms | 127ms | 1ms | 1ms | 10ms | 1.7s | 4.6s | 7.1s |
cudf | cudf | 30.4GB | 2.8s | 229ms | 4ms | 4ms | 7ms | 1.1s | 130ms | 4.2s | |
arrow | arrow | 24.9GB | 723ms | 45ms | 1ms | 2ms | 610ms | 309ms | 1.4s | 3.1s |
(N.B. Rcpp used in the dplyr implementation fully materializes all the Altrep numeric vectors when using filter()
or sample_n()
, which is why the first of these cases have additional overhead when using full Altrep.).
All numeric data is really a worst case scenario for vroom. The index takes about as much memory as the parsed data. Also because parsing doubles can be done quickly in parallel and text representations of doubles are only ~25 characters at most there isn’t a great deal of savings for delayed parsing.
For these reasons (and because the data.table implementation is very fast) vroom is a bit slower than fread for pure numeric data.
However because vroom is multi-threaded it is a bit quicker than readr and read.delim for this type of data.
code: bench/all_numeric-long
reading package | manipulating package | altrep | memory | read | head | tail | sample | filter | aggregate | total | |
---|---|---|---|---|---|---|---|---|---|---|---|
read.delim | base | 10.81GB | 1m 42.2s | 1.3s | 1ms | 1ms | 1ms | 4s | 44ms | 1m 47.5s | |
readr | dplyr | 8.9GB | 15s | 117ms | 1ms | 1ms | 10ms | 18ms | 92ms | 15.3s | |
cudf | cudf | 25.33GB | 2.1s | 271ms | 5ms | 4ms | 9ms | 29ms | 35ms | 2.4s | |
vroom | base | FALSE | 8.76GB | 1s | 127ms | 1ms | 1ms | 3ms | 12ms | 109ms | 1.3s |
vroom | dplyr | FALSE | 8.77GB | 950ms | 138ms | 1ms | 1ms | 11ms | 22ms | 58ms | 1.2s |
vroom | dplyr | TRUE | 9.29GB | 192ms | 234ms | 1ms | 1ms | 12ms | 49ms | 270ms | 756ms |
vroom | base | TRUE | 9.05GB | 271ms | 172ms | 1ms | 1ms | 3ms | 37ms | 269ms | 750ms |
arrow | dplyr | 12.89GB | 539ms | 111ms | 1ms | 1ms | 11ms | 18ms | 61ms | 739ms | |
arrow | arrow | 12.89GB | 345ms | 71ms | 1ms | 2ms | 67ms | 55ms | 158ms | 697ms | |
data.table | data.table | 9.92GB | 169ms | 14ms | 1ms | 1ms | 3ms | 9ms | 26ms | 220ms |
code: bench/all_numeric-wide
reading package | manipulating package | altrep | memory | read | head | tail | sample | filter | aggregate | total | |
---|---|---|---|---|---|---|---|---|---|---|---|
read.delim | base | 20.5GB | 7m 49.1s | 174ms | 9ms | 9ms | 10ms | 90ms | 6ms | 7m 49.4s | |
readr | dplyr | 11.4GB | 1m 5.3s | 163ms | 3ms | 3ms | 31ms | 16ms | 41ms | 1m 5.6s | |
arrow | dplyr | 18.3GB | 12s | 155ms | 2ms | 2ms | 20ms | 16ms | 50ms | 12.2s | |
arrow | arrow | 17.6GB | 3.7s | 422ms | 3ms | 146ms | 1.6s | 4s | 128ms | 10s | |
vroom | dplyr | FALSE | 11.3GB | 6.5s | 191ms | 2ms | 3ms | 27ms | 18ms | 48ms | 6.7s |
vroom | base | FALSE | 11.3GB | 6.4s | 216ms | 3ms | 3ms | 5ms | 6ms | 8ms | 6.6s |
cudf | cudf | 29.2GB | 3.8s | 388ms | 39ms | 38ms | 57ms | 89ms | 27ms | 4.5s | |
data.table | data.table | 12.7GB | 960ms | 143ms | 7ms | 7ms | 8ms | 8ms | 5ms | 1.1s | |
vroom | dplyr | TRUE | 13.3GB | 642ms | 199ms | 4ms | 4ms | 20ms | 23ms | 70ms | 958ms |
vroom | base | TRUE | 13.2GB | 663ms | 232ms | 4ms | 4ms | 5ms | 13ms | 36ms | 954ms |
code: bench/all_character-long
All character data is a best case scenario for vroom when using Altrep, as it takes full advantage of the lazy reading.
reading package | manipulating package | altrep | memory | read | head | tail | sample | filter | aggregate | total | |
---|---|---|---|---|---|---|---|---|---|---|---|
read.delim | base | 10.49GB | 1m 29.8s | 9ms | 1ms | 1ms | 2ms | 30ms | 434ms | 1m 30.3s | |
readr | dplyr | 10.29GB | 53.6s | 107ms | 1ms | 1ms | 11ms | 16ms | 371ms | 54.1s | |
vroom | dplyr | FALSE | 10.26GB | 45.7s | 107ms | 1ms | 1ms | 10ms | 16ms | 309ms | 46.1s |
data.table | data.table | 11.47GB | 36s | 18ms | 1ms | 1ms | 4ms | 18ms | 290ms | 36.4s | |
arrow | dplyr | 24.77GB | 33.1s | 110ms | 1ms | 1ms | 11ms | 16ms | 310ms | 33.5s | |
cudf | cudf | 25.75GB | 2.2s | 306ms | 6ms | 5ms | 12ms | 100ms | 53ms | 2.7s | |
vroom | base | TRUE | 8.99GB | 278ms | 126ms | 1ms | 1ms | 2ms | 161ms | 2.1s | 2.6s |
arrow | arrow | 23.18GB | 241ms | 70ms | 1ms | 2ms | 112ms | 64ms | 1.3s | 1.8s | |
vroom | dplyr | TRUE | 8.97GB | 179ms | 150ms | 1ms | 1ms | 9ms | 172ms | 1s | 1.6s |
code: bench/all_character-wide
reading package | manipulating package | altrep | memory | read | head | tail | sample | filter | aggregate | total | |
---|---|---|---|---|---|---|---|---|---|---|---|
read.delim | base | 19.1GB | 7m 55.5s | 192ms | 9ms | 9ms | 24ms | 193ms | 62ms | 7m 56s | |
readr | dplyr | 18.3GB | 4m 19.6s | 159ms | 2ms | 3ms | 23ms | 31ms | 70ms | 4m 19.8s | |
vroom | dplyr | FALSE | 17.8GB | 3m 21.4s | 160ms | 2ms | 3ms | 23ms | 31ms | 56ms | 3m 21.7s |
arrow | dplyr | 25.9GB | 2m 27.5s | 155ms | 2ms | 3ms | 24ms | 31ms | 57ms | 2m 27.8s | |
data.table | data.table | 19.2GB | 2m 24.5s | 200ms | 1ms | 1ms | 25ms | 132ms | 28ms | 2m 24.9s | |
arrow | arrow | 18.6GB | 6.9s | 479ms | 3ms | 123ms | 2.5s | 4.7s | 166ms | 14.9s | |
cudf | cudf | 28.6GB | 6.6s | 573ms | 75ms | 76ms | 200ms | 241ms | 33ms | 7.8s | |
vroom | base | TRUE | 12.6GB | 647ms | 175ms | 4ms | 4ms | 5ms | 44ms | 180ms | 1.1s |
vroom | dplyr | TRUE | 12.6GB | 559ms | 154ms | 4ms | 4ms | 19ms | 55ms | 115ms | 907ms |
code: bench/taxi_multiple
The benchmark reads all 12 files in the taxi trip fare data, totaling 173,179,759 rows and 11 columns for a total file size of 18.4G.
reading package | manipulating package | altrep | memory | read | head | tail | sample | filter | aggregate | total | |
---|---|---|---|---|---|---|---|---|---|---|---|
readr | dplyr | 63.8GB | 7m 56.6s | 93ms | 1ms | 1ms | 9ms | 3.8s | 12.7s | 8m 13.1s | |
data.table | data.table | 66.1GB | 5m 6.3s | 7ms | 1ms | 1ms | 1ms | 1.1s | 11s | 5m 18.4s | |
vroom | dplyr | FALSE | 62.4GB | 3m 47s | 1.7s | 1ms | 1ms | 9ms | 10.8s | 6.6s | 4m 6.2s |
arrow | arrow | 96.2GB | 72ms | 36ms | 15ms | 10.5s | 2m 13.5s | 10.3s | 18.2s | 2m 52.5s | |
vroom | base | TRUE | 83GB | 5.7s | 2.5s | 1ms | 1ms | 1ms | 21.5s | 2m 19.2s | 2m 48.9s |
vroom | dplyr | TRUE | 82.7GB | 5.6s | 2.4s | 1ms | 1ms | 8ms | 24.2s | 58.5s | 1m 30.8s |
Arrow is not included in these benchmarks because currently it does not support reading fixed-width files.
This fixed width dataset contains individual records of the characteristics of a 5 percent sample of people and housing units from the year 2000 and is freely available at https://www2.census.gov/census_2000/datasets/PUMS/FivePercent/California/all_California.zip. The data is split into files by state, and the state of California was used in this benchmark.
The data totals 2,342,339 rows and 37 columns with a total file size of 677M.
code: bench/fwf
reading package | manipulating package | altrep | memory | read | head | tail | sample | filter | aggregate | total | |
---|---|---|---|---|---|---|---|---|---|---|---|
read.delim | base | 13.96GB | 17m 33.8s | 20ms | 1ms | 2ms | 3ms | 347ms | 93ms | 17m 34.3s | |
readr | dplyr | 11.69GB | 30s | 124ms | 1ms | 1ms | 14ms | 79ms | 86ms | 30.3s | |
vroom | dplyr | FALSE | 11.4GB | 13.1s | 122ms | 1ms | 1ms | 14ms | 419ms | 84ms | 13.7s |
vroom | base | TRUE | 9.37GB | 149ms | 159ms | 1ms | 1ms | 4ms | 224ms | 1.6s | 2.1s |
vroom | dplyr | TRUE | 10.14GB | 153ms | 158ms | 1ms | 1ms | 14ms | 237ms | 940ms | 1.5s |
Arrow is not included in these benchmarks because it currently only writes Feather and Parquet format, not CSV
code: bench/taxi_writing
The benchmarks write out the taxi trip dataset in a few different ways.
gzfile()
(readr and vroom do this automatically for files ending in .gz
)pipe()
connection to pigz for the rest).compression | base | data.table | readr | vroom |
---|---|---|---|---|
gzip | 4m 18.8s | 1m 18.5s | 2m 47.4s | 1m 27.5s |
multithreaded_gzip | 2m 0.8s | 6.2s | 1m 22.8s | 6.7s |
zstandard | 2m 6.2s | NA | 1m 22.9s | 10s |
uncompressed | 2m 9.5s | 9.1s | 1m 30.8s | 9.2s |
package | version | date | source |
---|---|---|---|
base | 4.0.2 | 2020-07-21 | local |
data.table | 1.12.8 | 2019-12-09 | CRAN (R 4.0.0) |
dplyr | 1.0.0 | 2020-05-29 | CRAN (R 4.0.0) |
readr | 1.3.1 | 2018-12-21 | CRAN (R 4.0.0) |
vroom | 1.2.1 | 2020-05-12 | CRAN (R 4.0.0) |