Accelerating Analytics With

Neal Richardson
@enpiar

January 29, 2020
Slides: enpiar.com/talks/rstudio-conf-2020/

1 / 30

Ursa Labs

Build cross-language, open libraries for data science
Grow Apache Arrow ecosystem
Funding and employment for full-time developers
Not-for-profit, funded by multiple corporations

Come talk to us at the RStudio Community booth today 3:45pm

https://ursalabs.org

2 / 30

Going to talk about Arrow: what it is, why you should use it, and how you can get involved in the community.
But first, very briefly who I am so you know where I'm coming from:
I'm engineering director at Ursa Labs, not-for-profit dedicated to developing open source data science tools. Main contributors to and maintainers of the Apache Arrow project.

https://enpiar.com/talks/rstudio-conf-2020/
3 / 30

https://enpiar.com/talks/rstudio-conf-2020/
What is Arrow?3 / 30

Whenever I tell people I work on Arrow, the response is usually "oh cool, I've heard of that. Wait, what exactly is Arrow?"
Arrow is actually a few different things. The way I like to summarize the project and its goals is:

https://enpiar.com/talks/rstudio-conf-2020/
A foundation for the 
 next-generation of data frames4 / 30

https://enpiar.com/talks/rstudio-conf-2020/
Yeah R is cool but have you ever5 / 30

Yeah R is cool but have you ever

⬆️ had a dataset bigger than memory?

5 / 30

Yeah R is cool but have you ever

⬆️ had a dataset bigger than memory?

↩️ had data split across lots of files?

5 / 30

Yeah R is cool but have you ever

⬆️ had a dataset bigger than memory?

↩️ had data split across lots of files?

🔀 had complex data types (map columns, struct/data-frame columns, etc.)?

5 / 30

Yeah R is cool but have you ever

⬆️ had a dataset bigger than memory?

↩️ had data split across lots of files?

🔀 had complex data types (map columns, struct/data-frame columns, etc.)?

🔝 wanted to use more than 1 CPU (or GPUs)?

5 / 30

I love R: so expressive, so powerful. But some limitations, which get more critical as our data gets bigger and more complex.

These are not just limitations of R

https://wesmckinney.com/blog/apache-arrow-pandas-internals/

6 / 30

Of course, these aren't just R problems. Wes McKinney, creater of Pandas, talked about this problem several years ago about the Python data ecosystem. Same issues: memory-bound, handling data types, missing data, parallel processing, etc.
He and other _ got together, realized they were all trying to solve the same problems in their respective languages/databases/domains, and decided to join forces

Apache Arrow

🎉 Announced 2016

🦚 Feather package: interoperable data frame storage for R and Python, prototype of Arrow format

💡 Built on lessons of existing data frame libraries and databases

🏗️ Shared foundation for data analysis

🤖 Designed to take advantage of modern hardware

🔗 https://arrow.apache.org/

7 / 30

Apache Arrow

Format for how data is arranged in memory: columnar, language-independent

8 / 30

Apache Arrow

Format for how data is arranged in memory: columnar, language-independent

Implementations or bindings in 11 languages

... and more

9 / 30

The arrow R package

📩 Wraps the C++ library and lets you work with these data structures efficiently in R with a familiar interface

➡️ On CRAN since August 2019: install.packages("arrow")

📅 0.16 release about to reach CRAN

🔗 https://arrow.apache.org/docs/r

10 / 30

The arrow R package

📩 Wraps the C++ library and lets you work with these data structures efficiently in R with a familiar interface

➡️ On CRAN since August 2019: install.packages("arrow")

📅 0.16 release about to reach CRAN

🔗 https://arrow.apache.org/docs/r

🌉 Nightly binaries available:

install.packages("arrow",
  repos = "https://dl.bintray.com/ursalabs/arrow-r")

🌃 Nightly docs: https://ursalabs.org/arrow-r-nightly/

10 / 30

Example

11 / 30

Arrow datasets

vignette("dataset", package = "arrow")

➡️ Treat many files as a single entity

➡️ Use file paths to provide partition information

➡️ Select/filter is pushed to individual files, done in parallel

12 / 30

Arrow datasets

vignette("dataset", package = "arrow")

➡️ Treat many files as a single entity

➡️ Use file paths to provide partition information

➡️ Select/filter is pushed to individual files, done in parallel

🔜 Future development: more file formats, more storage layers (S3, HDFS, GCP, Azure), aggregation in C++

12 / 30

New feature in the 0.16 release

Reading/writing Parquet files

🗄️ Popular open standard binary file format for columnar data

💾 Used for I/O on nearly all modern data warehousing platforms

🗜️ Creates small files benefiting from compression and other encodings

13 / 30

Reading/writing Parquet files

Benchmarking: see https://ursalabs.org/blog/2019-10-columnar-perf/

Benchmark Case	File size	Average read time
arrow::read_parquet	113 MB	4.09s
arrow::read_feather	3.96 GB	3.09s
fst::read_fst	503 MB	3.75s
data.table::fread	1.52 GB	5.09s
feather::read_feather (old)	3.96 GB	5.21s

14 / 30

https://enpiar.com/talks/rstudio-conf-2020/
15 / 30

Why Arrow?

Why not just use packages
X, Y, ..., ZZ instead?

15 / 30

Of course, Arrow isn't the only project that tries to solve these problems. Packages like XXX to handle bigger data and memory map Packages to do parallel processing And you could put your data in a database and query it with a dplyr-family package

If you like your tech stack, you can keep your tech stack

That said, there are several qualities of the Arrow project that distinguish it and make it worth considering

Arrow is a big, active project 📈

16 / 30

Arrow is a big, active project 📈

17 / 30

🛫 Flight: client-server framework for fast transport of data https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/

🦠 Plasma: shared-memory object store https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/

🏹 Gandiva: LLVM expression compiler http://arrow.apache.org/blog/2018/12/05/gandiva-donation/

and more

18 / 30

https://enpiar.com/talks/rstudio-conf-2020/
Lots of projects are using Arrow19 / 30

Lots of projects using Arrow as an efficient format to work with data and to transfer it
TODO specific projects to name-check (Tensorflow exchange? Athena federated query?)

https://enpiar.com/talks/rstudio-conf-2020/
Interoperability20 / 30

↔️ Interchange format: e.g. get data from Spark more efficiently. rather than write out to CSV, which is row based, and have to read from disk, parse strings, infer types, transpose to columns.

Example: Spark and R

https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/

Up to 40x speedup when pulling data from Spark to R

All you have to do is library(arrow)

21 / 30

Example: Spark and R

https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/

Up to 40x speedup when pulling data from Spark to R

All you have to do is library(arrow)

Want more? Stay here for Javier's talk following this one

21 / 30

Arrow is language independent

🤝 Improve collaboration with non-R users

🗣 Lets us access in R projects from other languages

22 / 30

Arrow is language independent

🔜 Coming soon: reticulate support 🐍

Something like:

library(arrow)
library(reticulate)
cudf <- import("cudf")
df <- cudf$read_csv("huge_file.csv")
results <- df$groupby(c("year", "month"))$tip_amount$mean()
ggplot(results, ...)

🔗 https://rapids.ai/

23 / 30

https://enpiar.com/talks/rstudio-conf-2020/
Getting involved24 / 30

Arrow is still a pretty young project. And while there's a lot of useful things you can do with it now---read Parquet files, read multi-file datasets, speed up Spark---there's a lot more we're working to build.

Arrow is an open-source, community driven project, and we depend on contributions from users like you to make it happen.

1. Try arrow

⬇️ install.packages("arrow")

⬇️ conda install -c conda-forge r-arrow

25 / 30

1. Try arrow

⬇️ install.packages("arrow")

⬇️ conda install -c conda-forge r-arrow

🌉 Nightly binaries available:

install.packages("arrow",
  repos = "https://dl.bintray.com/ursalabs/arrow-r")

🌃 Nightly docs: https://ursalabs.org/arrow-r-nightly/

25 / 30

1. Try arrow

⬇️ install.packages("arrow")

⬇️ conda install -c conda-forge r-arrow

🌉 Nightly binaries available:

install.packages("arrow",
  repos = "https://dl.bintray.com/ursalabs/arrow-r")

🌃 Nightly docs: https://ursalabs.org/arrow-r-nightly/

As of 0.16 It Just Works on Linux with no system dependencies

See vignette("install", package = "arrow")

25 / 30

1. Try arrow


remotes::install_github("apache/arrow/r")


options(repos = c(
  "https://dl.bintray.com/ursalabs/arrow-r",
  getOption("repos")
))
install.packages("arrow")

26 / 30

First, you can try to use it. arrow is on CRAN, and as of the 0.16 release, installation on Linux platforms should just work without requiring any system dependencies. Binaries for macOS and Windows are also available and work out of the box.
The Apache Arrow project makes official releases every few months, but we're continually adding new features and improvements. Ursa Labs hosts nightly builds at a CRAN-like repository, which you can point install.packages at.

2. Let us know what you think

➡️ Arrow is under active development

🤯 Show us your dirty data and hairy use cases

🔗 https://issues.apache.org/jira/projects/ARROW/issues

27 / 30

Of course, we'd love to hear how Arrow works for you, good and bad.

3. Pitch in

👋 We love new contributors!

✍️ Improve docs etc.

28 / 30

Support our dedicated engineering team (C++/Python/R)
Spearhead big projects: datasets API, query engine
Sustain the open-source community: bug triage, code review, CI, coordination, etc.

New: GitHub Sponsors! https://github.com/sponsors/ursa-labs/

29 / 30

Industry consortium, several sponsors, RStudio first among them
Money goes to supporting our team of 7
We contribute a majority of the C++/Python/R code in the project,
and because we are able to focus full-time on the project, we can spearhead big projects like the datasets API and the fast query engine we plan to work on this year
We also work to sustain the open source community to make it easier for others to participate: make sure bug reports get triaged, pull requests get reviewed and merged, continuous integration, etc.

Thank you!

RStudio Community booth
Today @ 3:45pm

@ApacheArrow
@enpiar

30 / 30

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Accelerating Analytics With

Neal Richardson@enpiar

January 29, 2020Slides: enpiar.com/talks/rstudio-conf-2020/

Ursa Labs

What is Arrow?

A foundation for the next-generation of data frames

Yeah R is cool but have you ever

Yeah R is cool but have you ever

Yeah R is cool but have you ever

Yeah R is cool but have you ever

Yeah R is cool but have you ever

These are not just limitations of R

Apache Arrow

Apache Arrow

Apache Arrow

The arrow R package

The arrow R package

Example

Arrow datasets

Arrow datasets

Reading/writing Parquet files

Reading/writing Parquet files

Why Arrow?

Arrow is a big, active project 📈

Arrow is a big, active project 📈

Lots of facets

Lots of projects are using Arrow

Interoperability

Example: Spark and R

Example: Spark and R

Arrow is language independent

Arrow is language independent

Getting involved

1. Try arrow

1. Try arrow

1. Try arrow

1. Try arrow

2. Let us know what you think

3. Pitch in

4. Sponsor Ursa Labs

Thank you!

Ursa Labs

Help

Neal Richardson
@enpiar

January 29, 2020
Slides: enpiar.com/talks/rstudio-conf-2020/

A foundation for the
next-generation of data frames