Build cross-language, open libraries for data science
Grow Apache Arrow ecosystem
Funding and employment for full-time developers
Not-for-profit, funded by multiple corporations
Come talk to us at the RStudio Community booth today 3:45pm
âŦī¸ had a dataset bigger than memory?
âŦī¸ had a dataset bigger than memory?
âŠī¸ had data split across lots of files?
âŦī¸ had a dataset bigger than memory?
âŠī¸ had data split across lots of files?
đ had complex data types (map columns, struct/data-frame columns, etc.)?
âŦī¸ had a dataset bigger than memory?
âŠī¸ had data split across lots of files?
đ had complex data types (map columns, struct/data-frame columns, etc.)?
đ wanted to use more than 1 CPU (or GPUs)?
https://wesmckinney.com/blog/apache-arrow-pandas-internals/
Of course, these aren't just R problems. Wes McKinney, creater of Pandas, talked about this problem several years ago about the Python data ecosystem. Same issues: memory-bound, handling data types, missing data, parallel processing, etc.
He and other _ got together, realized they were all trying to solve the same problems in their respective languages/databases/domains, and decided to join forces
đ Announced 2016
đĻ Feather package: interoperable data frame storage for R and Python, prototype of Arrow format
đĄ Built on lessons of existing data frame libraries and databases
đī¸ Shared foundation for data analysis
đ¤ Designed to take advantage of modern hardware
đ https://arrow.apache.org/
Format for how data is arranged in memory: columnar, language-independent
Format for how data is arranged in memory: columnar, language-independent
Implementations or bindings in 11 languages
... and more
đŠ Wraps the C++ library and lets you work with these data structures efficiently in R with a familiar interface
âĄī¸ On CRAN since August 2019: install.packages("arrow")
đ 0.16 release about to reach CRAN
đ https://arrow.apache.org/docs/r
đŠ Wraps the C++ library and lets you work with these data structures efficiently in R with a familiar interface
âĄī¸ On CRAN since August 2019: install.packages("arrow")
đ 0.16 release about to reach CRAN
đ https://arrow.apache.org/docs/r
đ Nightly binaries available:
install.packages("arrow", repos = "https://dl.bintray.com/ursalabs/arrow-r")
đ Nightly docs: https://ursalabs.org/arrow-r-nightly/
vignette("dataset", package = "arrow")
âĄī¸ Treat many files as a single entity
âĄī¸ Use file paths to provide partition information
âĄī¸ Select/filter is pushed to individual files, done in parallel
vignette("dataset", package = "arrow")
âĄī¸ Treat many files as a single entity
âĄī¸ Use file paths to provide partition information
âĄī¸ Select/filter is pushed to individual files, done in parallel
đ Future development: more file formats, more storage layers (S3, HDFS, GCP, Azure), aggregation in C++
New feature in the 0.16 release
đī¸ Popular open standard binary file format for columnar data
đž Used for I/O on nearly all modern data warehousing platforms
đī¸ Creates small files benefiting from compression and other encodings
Benchmarking: see https://ursalabs.org/blog/2019-10-columnar-perf/
Benchmark Case | File size | Average read time |
---|---|---|
arrow::read_parquet | 113 MB | 4.09s |
arrow::read_feather | 3.96 GB | 3.09s |
fst::read_fst | 503 MB | 3.75s |
data.table::fread | 1.52 GB | 5.09s |
feather::read_feather (old) | 3.96 GB | 5.21s |
Why not just use packages
X, Y, ..., ZZ instead?
Of course, Arrow isn't the only project that tries to solve these problems. Packages like XXX to handle bigger data and memory map Packages to do parallel processing And you could put your data in a database and query it with a dplyr-family package
If you like your tech stack, you can keep your tech stack
That said, there are several qualities of the Arrow project that distinguish it and make it worth considering
đĢ Flight: client-server framework for fast transport of data https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
đĻ Plasma: shared-memory object store https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
đš Gandiva: LLVM expression compiler http://arrow.apache.org/blog/2018/12/05/gandiva-donation/
and more
âī¸ Interchange format: e.g. get data from Spark more efficiently. rather than write out to CSV, which is row based, and have to read from disk, parse strings, infer types, transpose to columns.
https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/
Up to 40x speedup when pulling data from Spark to R
All you have to do is library(arrow)
https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/
Up to 40x speedup when pulling data from Spark to R
All you have to do is library(arrow)
Want more? Stay here for Javier's talk following this one
đ¤ Improve collaboration with non-R users
đŖ Lets us access in R projects from other languages
đ Coming soon: reticulate support đ
Something like:
library(arrow)library(reticulate)cudf <- import("cudf")df <- cudf$read_csv("huge_file.csv")results <- df$groupby(c("year", "month"))$tip_amount$mean()ggplot(results, ...)
đ https://rapids.ai/
Arrow is still a pretty young project. And while there's a lot of useful things you can do with it now---read Parquet files, read multi-file datasets, speed up Spark---there's a lot more we're working to build.
Arrow is an open-source, community driven project, and we depend on contributions from users like you to make it happen.
âŦī¸ install.packages("arrow")
âŦī¸ conda install -c conda-forge r-arrow
âŦī¸ install.packages("arrow")
âŦī¸ conda install -c conda-forge r-arrow
đ Nightly binaries available:
install.packages("arrow", repos = "https://dl.bintray.com/ursalabs/arrow-r")
đ Nightly docs: https://ursalabs.org/arrow-r-nightly/
âŦī¸ install.packages("arrow")
âŦī¸ conda install -c conda-forge r-arrow
đ Nightly binaries available:
install.packages("arrow", repos = "https://dl.bintray.com/ursalabs/arrow-r")
đ Nightly docs: https://ursalabs.org/arrow-r-nightly/
As of 0.16 It Just Works on Linux with no system dependencies
See vignette("install", package = "arrow")
remotes::install_github("apache/arrow/r")
options(repos = c( "https://dl.bintray.com/ursalabs/arrow-r", getOption("repos")))install.packages("arrow")
First, you can try to use it. arrow
is on CRAN, and as of the 0.16 release, installation on Linux platforms should just work without requiring any system dependencies. Binaries for macOS and Windows are also available and work out of the box.
The Apache Arrow project makes official releases every few months, but we're continually adding new features and improvements. Ursa Labs hosts nightly builds at a CRAN-like repository, which you can point install.packages
at.
âĄī¸ Arrow is under active development
đ¤¯ Show us your dirty data and hairy use cases
đ https://issues.apache.org/jira/projects/ARROW/issues
Of course, we'd love to hear how Arrow works for you, good and bad.
đ We love new contributors!
âī¸ Improve docs etc.
Support our dedicated engineering team (C++/Python/R)
Spearhead big projects: datasets API, query engine
Sustain the open-source community: bug triage, code review, CI, coordination, etc.
Contact us: info@ursalabs.org
New: GitHub Sponsors! https://github.com/sponsors/ursa-labs/
RStudio Community booth
Today @ 3:45pm
@ApacheArrow
@enpiar
Build cross-language, open libraries for data science
Grow Apache Arrow ecosystem
Funding and employment for full-time developers
Not-for-profit, funded by multiple corporations
Come talk to us at the RStudio Community booth today 3:45pm
Keyboard shortcuts
â, â, Pg Up, k | Go to previous slide |
â, â, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |