with Arrow, Parquet, and DuckDB
useR! Conference 2025
August 10, 2025
US Census Public Use Microdata Sample (PUMS)
53 million rows
311 columns
45 GB in CSVs on disk
~80 GB in memory in R (if it were possible)
With arrow
, it is possible
Parquet files for storing data on disk
Dataset: data chunked into meaningful partitions (files)
Arrowâs fast query engine, computing in parallel across chunks of data
Open standard file format for storing data on disk
Arrow : Memory :: Parquet : Disk
Widely adopted across big data ecosystem, implemented in many languages
Row vs. column orientation: Parquet is columnar (like R and Arrow)
Columnar allows selected reading of data: we usually donât need all columns
Columnar allows better compression and encoding
Types: rich type system in Parquet, including nested and complex types
Types are preserved in schema, not inferred at read time
High fidelity, smaller file sizes, faster reads
pums_person$files[1:10] |> stringr::str_replace("^.*data/person/", "")
## [1] "year=2005/location=ak/part-0.parquet"
## [2] "year=2005/location=al/part-0.parquet"
## [3] "year=2005/location=ar/part-0.parquet"
## [4] "year=2005/location=az/part-0.parquet"
## [5] "year=2005/location=ca/part-0.parquet"
## [6] "year=2005/location=co/part-0.parquet"
## [7] "year=2005/location=ct/part-0.parquet"
## [8] "year=2005/location=dc/part-0.parquet"
## [9] "year=2005/location=de/part-0.parquet"
## [10] "year=2005/location=fl/part-0.parquet"
âHive-style partitioningâ: both column names and values are encoded in the file path
Convention: directory (or bucket in cloud storage) of Parquet files with a common schema
We can split data into files called partitions based on the values of one or more columns
When we query it, can skip over files that donât match our query
Example query from before:
pums_person |>
filter(year == 2021, grepl("Subway|Light rail", JWTRNS)) |>
summarize(n = sum(PWGTP)) |>
collect()
We didnât have to scan 884 files, we only had to scan the 52 files that contain data for the year 2021:
my_dataset <- open_csv_dataset(path = "./data/my-oversized-csv.csv")
my_dataset |>
mutate(year = year(datetime), month = month(datetime)) |>
write_dataset(
path = "./data/my-much-nicer-parquet-dataset",
partitioning = c("year", "month")
)
Can point at a single large file that you canât read into memory: arrow
can process in chunks
Can use a dataset that already has partitions and write to new partitions
What queries you will run
Number of unique values in partition columns
Too many partitions â> too many files, lose the benefits of column orientation and compression
pums_person |>
filter(year == 2021, grepl("Subway|Light rail", JWTRNS)) |>
summarize(n = sum(PWGTP)) |>
collect()
Each step builds up a query, like dbplyr
No computation happens until collect()
This allows optimizations: only need 52 files and 2 columns, not 884 files and 311 columns
arrow
has great dplyr
supportCurrently: 222 R functions, 37 dplyr
verbs
All the things youâd expect from dbplyr
Deep support for stringr
and lubridate
arrow
has great dplyr
supportpums_person |>
filter(AGEP > 18) |>
transmute(
higher_education = stringr::str_detect(SCHL, "(Bach|Mast|Prof|Doct|college|degree)")
)
## FileSystemDataset (query)
## higher_education: bool (match_substring_regex(SCHL, {pattern="(Bach|Mast|Prof|Doct|college|degree)", ignore_case=false}))
##
## * Filter: (AGEP > 18)
## See $.data for the source Arrow object
arrow
has great dplyr
supportCurrently: 222 R functions, 37 dplyr
verbs
All the things youâd expect from dbplyr
Deep support for stringr
and lubridate
Additional arrow
compute functions (280) available
User-defined functions (UDFs): write your own R function and run in the arrow
query engine
For many workflows, either one will serve just fine
Historically, arrow
has had richer dplyr integration
If you need advanced SQL features, use duckdb
Mostly a matter of preference
Use Parquet files
Split large datasets into partitions based on common query patterns
Run modern query engines (arrow
, duckdb
) on your machine: just install from CRAN, no need for a big compute cluster or service
The arrow
package provides a lot of this all in one. But if you need something lighterweight or more specialized, you have options (nanoparquet
, nanoarrow
, duckdb
, etc.)
The Arrow format is the key to this modern data stack. It is what allows all of these tools to work together and share data efficiently
pums_bucket <- s3_bucket("scaling-arrow-pums")
pums_s3 <- open_dataset(pums_bucket$path("person"))
# Query execution with lazy evaluation
pums_s3 |>
filter(year == 2021, location == "ca", AGEP >= 16) |>
group_by(year, ST) |>
summarize(
mean_commute_time = sum(JWMNP * PWGTP, na.rm = TRUE) /
sum(PWGTP),
count = n()
) |>
collect()