Reading delimited files

The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.

Because the read.delim results are so much slower than the others they are excluded from the plots, but are retained in the tables.

Taxi Trip Dataset

This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at http://chriswhong.com/open-data/foil_nyc_taxi/. It is also hosted on archive.org.

The first table trip_fare_1.csv is 1.55G in size.

#> Observations: 14,776,615
#> Variables: 11
#> $ medallion       <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license    <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id       <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type    <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount     <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge       <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax         <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount    <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...

Taxi Benchmarks

code: bench/taxi

~~All benchmarks were run on a Amazon EC2 m5.4xlarge instance with 16 vCPUs and an EBS volume type.~~

In order to test the GPU-based cudf library, these benchmarks were run on an NVIDIA DGX workstation with the following specifications:

CPU:

Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
256 GB RAM (2400 MHz), SSD

GPU:

4 NVIDIA V100 GPUs, 5,120 CUDA cores each
32 GB memory per GPU

R and all necessary system dependencies are installed via conda. Benchmarks were run with development versions of arrow and cudf and released versions of all other packages.

The benchmarks labeled vroom_base uses vroom with base functions for manipulation. vroom_dplyr uses vroom to read the file and dplyr functions to manipulate. data.table uses fread() to read the file and data.table functions to manipulate and readr uses readr to read the file and dplyr to manipulate. By default vroom only uses Altrep for character vectors, these are labeled vroom(altrep: normal). The benchmarks labeled vroom(altrep: full) instead use Altrep vectors for all supported types and vroom(altrep: none) disable Altrep entirely.

The arrow benchmarks are divided into three types: (1) read into an R data.frame using arrow and then all subsequent processing happens with dplyr; (2) same but using data.table to aggregate in R; (3) data is held in an Arrow Table and computed on in Arrow to the extent possible (as of writing, this means selecting and filtering happen in Arrow but aggregation requires pulling a window of data into an R data.frame.)

cudf is a Python package, here called from R using reticulate, and results are pulled into R using the arrow R package’s methods.

The following operations are performed.

The data is read
print() - N.B. read.delim uses print(head(x, 10)) because printing the whole dataset takes > 10 minutes
head()
tail()
Sampling 100 random rows
Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).
Aggregation of mean fare amount per payment type.

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		13.1GB	1m 22.6s	7ms	1ms	1ms	1ms	1.2s	895ms	1m 24.8s
readr	dplyr		13.1GB	35.5s	95ms	1ms	1ms	10ms	217ms	498ms	36.3s
vroom	dplyr	FALSE	12.7GB	17.6s	103ms	1ms	1ms	9ms	946ms	1.2s	19.9s
data.table	data.table		13.7GB	19.4s	16ms	1ms	1ms	1ms	145ms	226ms	19.8s
vroom	base	TRUE	14.3GB	676ms	118ms	1ms	1ms	1ms	1.5s	10.6s	12.9s
arrow	dplyr		25.4GB	6.6s	433ms	657ms	231ms	179ms	382ms	598ms	9.1s
arrow	data.table		26.7GB	6.5s	14ms	1ms	1ms	1ms	120ms	778ms	7.4s
vroom	dplyr	TRUE	14.3GB	694ms	127ms	1ms	1ms	10ms	1.7s	4.6s	7.1s
cudf	cudf		30.4GB	2.8s	229ms	4ms	4ms	7ms	1.1s	130ms	4.2s
arrow	arrow		24.9GB	723ms	45ms	1ms	2ms	610ms	309ms	1.4s	3.1s

(N.B. Rcpp used in the dplyr implementation fully materializes all the Altrep numeric vectors when using filter() or sample_n(), which is why the first of these cases have additional overhead when using full Altrep.).

All numeric data

All numeric data is really a worst case scenario for vroom. The index takes about as much memory as the parsed data. Also because parsing doubles can be done quickly in parallel and text representations of doubles are only ~25 characters at most there isn’t a great deal of savings for delayed parsing.

For these reasons (and because the data.table implementation is very fast) vroom is a bit slower than fread for pure numeric data.

However because vroom is multi-threaded it is a bit quicker than readr and read.delim for this type of data.

Long

code: bench/all_numeric-long

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		10.81GB	1m 42.2s	1.3s	1ms	1ms	1ms	4s	44ms	1m 47.5s
readr	dplyr		8.9GB	15s	117ms	1ms	1ms	10ms	18ms	92ms	15.3s
cudf	cudf		25.33GB	2.1s	271ms	5ms	4ms	9ms	29ms	35ms	2.4s
vroom	base	FALSE	8.76GB	1s	127ms	1ms	1ms	3ms	12ms	109ms	1.3s
vroom	dplyr	FALSE	8.77GB	950ms	138ms	1ms	1ms	11ms	22ms	58ms	1.2s
vroom	dplyr	TRUE	9.29GB	192ms	234ms	1ms	1ms	12ms	49ms	270ms	756ms
vroom	base	TRUE	9.05GB	271ms	172ms	1ms	1ms	3ms	37ms	269ms	750ms
arrow	dplyr		12.89GB	539ms	111ms	1ms	1ms	11ms	18ms	61ms	739ms
arrow	arrow		12.89GB	345ms	71ms	1ms	2ms	67ms	55ms	158ms	697ms
data.table	data.table		9.92GB	169ms	14ms	1ms	1ms	3ms	9ms	26ms	220ms

Wide

code: bench/all_numeric-wide

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		20.5GB	7m 49.1s	174ms	9ms	9ms	10ms	90ms	6ms	7m 49.4s
readr	dplyr		11.4GB	1m 5.3s	163ms	3ms	3ms	31ms	16ms	41ms	1m 5.6s
arrow	dplyr		18.3GB	12s	155ms	2ms	2ms	20ms	16ms	50ms	12.2s
arrow	arrow		17.6GB	3.7s	422ms	3ms	146ms	1.6s	4s	128ms	10s
vroom	dplyr	FALSE	11.3GB	6.5s	191ms	2ms	3ms	27ms	18ms	48ms	6.7s
vroom	base	FALSE	11.3GB	6.4s	216ms	3ms	3ms	5ms	6ms	8ms	6.6s
cudf	cudf		29.2GB	3.8s	388ms	39ms	38ms	57ms	89ms	27ms	4.5s
data.table	data.table		12.7GB	960ms	143ms	7ms	7ms	8ms	8ms	5ms	1.1s
vroom	dplyr	TRUE	13.3GB	642ms	199ms	4ms	4ms	20ms	23ms	70ms	958ms
vroom	base	TRUE	13.2GB	663ms	232ms	4ms	4ms	5ms	13ms	36ms	954ms

All character data

code: bench/all_character-long

All character data is a best case scenario for vroom when using Altrep, as it takes full advantage of the lazy reading.

Long

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		10.49GB	1m 29.8s	9ms	1ms	1ms	2ms	30ms	434ms	1m 30.3s
readr	dplyr		10.29GB	53.6s	107ms	1ms	1ms	11ms	16ms	371ms	54.1s
vroom	dplyr	FALSE	10.26GB	45.7s	107ms	1ms	1ms	10ms	16ms	309ms	46.1s
data.table	data.table		11.47GB	36s	18ms	1ms	1ms	4ms	18ms	290ms	36.4s
arrow	dplyr		24.77GB	33.1s	110ms	1ms	1ms	11ms	16ms	310ms	33.5s
cudf	cudf		25.75GB	2.2s	306ms	6ms	5ms	12ms	100ms	53ms	2.7s
vroom	base	TRUE	8.99GB	278ms	126ms	1ms	1ms	2ms	161ms	2.1s	2.6s
arrow	arrow		23.18GB	241ms	70ms	1ms	2ms	112ms	64ms	1.3s	1.8s
vroom	dplyr	TRUE	8.97GB	179ms	150ms	1ms	1ms	9ms	172ms	1s	1.6s

Wide

code: bench/all_character-wide

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		19.1GB	7m 55.5s	192ms	9ms	9ms	24ms	193ms	62ms	7m 56s
readr	dplyr		18.3GB	4m 19.6s	159ms	2ms	3ms	23ms	31ms	70ms	4m 19.8s
vroom	dplyr	FALSE	17.8GB	3m 21.4s	160ms	2ms	3ms	23ms	31ms	56ms	3m 21.7s
arrow	dplyr		25.9GB	2m 27.5s	155ms	2ms	3ms	24ms	31ms	57ms	2m 27.8s
data.table	data.table		19.2GB	2m 24.5s	200ms	1ms	1ms	25ms	132ms	28ms	2m 24.9s
arrow	arrow		18.6GB	6.9s	479ms	3ms	123ms	2.5s	4.7s	166ms	14.9s
cudf	cudf		28.6GB	6.6s	573ms	75ms	76ms	200ms	241ms	33ms	7.8s
vroom	base	TRUE	12.6GB	647ms	175ms	4ms	4ms	5ms	44ms	180ms	1.1s
vroom	dplyr	TRUE	12.6GB	559ms	154ms	4ms	4ms	19ms	55ms	115ms	907ms

Reading multiple delimited files

code: bench/taxi_multiple

The benchmark reads all 12 files in the taxi trip fare data, totaling 173,179,759 rows and 11 columns for a total file size of 18.4G.

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
readr	dplyr		63.8GB	7m 56.6s	93ms	1ms	1ms	9ms	3.8s	12.7s	8m 13.1s
data.table	data.table		66.1GB	5m 6.3s	7ms	1ms	1ms	1ms	1.1s	11s	5m 18.4s
vroom	dplyr	FALSE	62.4GB	3m 47s	1.7s	1ms	1ms	9ms	10.8s	6.6s	4m 6.2s
arrow	arrow		96.2GB	72ms	36ms	15ms	10.5s	2m 13.5s	10.3s	18.2s	2m 52.5s
vroom	base	TRUE	83GB	5.7s	2.5s	1ms	1ms	1ms	21.5s	2m 19.2s	2m 48.9s
vroom	dplyr	TRUE	82.7GB	5.6s	2.4s	1ms	1ms	8ms	24.2s	58.5s	1m 30.8s

Reading fixed width files

Arrow is not included in these benchmarks because currently it does not support reading fixed-width files.

United States Census 5-Percent Public Use Microdata Sample files

This fixed width dataset contains individual records of the characteristics of a 5 percent sample of people and housing units from the year 2000 and is freely available at https://www2.census.gov/census_2000/datasets/PUMS/FivePercent/California/all_California.zip. The data is split into files by state, and the state of California was used in this benchmark.

The data totals 2,342,339 rows and 37 columns with a total file size of 677M.

Census data benchmarks

code: bench/fwf

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		13.96GB	17m 33.8s	20ms	1ms	2ms	3ms	347ms	93ms	17m 34.3s
readr	dplyr		11.69GB	30s	124ms	1ms	1ms	14ms	79ms	86ms	30.3s
vroom	dplyr	FALSE	11.4GB	13.1s	122ms	1ms	1ms	14ms	419ms	84ms	13.7s
vroom	base	TRUE	9.37GB	149ms	159ms	1ms	1ms	4ms	224ms	1.6s	2.1s
vroom	dplyr	TRUE	10.14GB	153ms	158ms	1ms	1ms	14ms	237ms	940ms	1.5s

Writing delimited files

Arrow is not included in these benchmarks because it currently only writes Feather and Parquet format, not CSV

code: bench/taxi_writing

The benchmarks write out the taxi trip dataset in a few different ways.

An uncompressed file
A gzip compressed file using gzfile() (readr and vroom do this automatically for files ending in .gz)
A gzip compressed file compressed with multiple threads (natively for data.table and using a pipe() connection to pigz for the rest).
A Zstandard compressed file (data.table does not support this format).

compression	base	data.table	readr	vroom
gzip	4m 18.8s	1m 18.5s	2m 47.4s	1m 27.5s
multithreaded_gzip	2m 0.8s	6.2s	1m 22.8s	6.7s
zstandard	2m 6.2s	NA	1m 22.9s	10s
uncompressed	2m 9.5s	9.1s	1m 30.8s	9.2s

Session and package information

package	version	date	source
base	4.0.2	2020-07-21	local
data.table	1.12.8	2019-12-09	CRAN (R 4.0.0)
dplyr	1.0.0	2020-05-29	CRAN (R 4.0.0)
readr	1.3.1	2018-12-21	CRAN (R 4.0.0)
vroom	1.2.1	2020-05-12	CRAN (R 4.0.0)

Vroom Benchmarks

Reading delimited files

Taxi Trip Dataset

Taxi Benchmarks

All numeric data

Long

Wide

All character data

Long

Wide

Reading multiple delimited files

Reading fixed width files

United States Census 5-Percent Public Use Microdata Sample files

Census data benchmarks

Writing delimited files

Session and package information