+ - 0:00:00
Notes for current slide
Notes for next slide
  • Going to talk about Arrow: what it is, why you should use it, and how you can get involved in the community.
  • But first, very briefly who I am so you know where I'm coming from:
  • I'm engineering director at Ursa Labs, not-for-profit dedicated to developing open source data science tools. Main contributors to and maintainers of the Apache Arrow project.

Accelerating Analytics With

Neal Richardson
@enpiar

January 29, 2020
Slides: enpiar.com/talks/rstudio-conf-2020/

1 / 30

Ursa Labs

  • Build cross-language, open libraries for data science

  • Grow Apache Arrow ecosystem

  • Funding and employment for full-time developers

  • Not-for-profit, funded by multiple corporations

Come talk to us at the RStudio Community booth today 3:45pm

https://ursalabs.org

2 / 30
  • Going to talk about Arrow: what it is, why you should use it, and how you can get involved in the community.
  • But first, very briefly who I am so you know where I'm coming from:
  • I'm engineering director at Ursa Labs, not-for-profit dedicated to developing open source data science tools. Main contributors to and maintainers of the Apache Arrow project.
3 / 30

What is Arrow?

3 / 30
  • Whenever I tell people I work on Arrow, the response is usually "oh cool, I've heard of that. Wait, what exactly is Arrow?"
  • Arrow is actually a few different things. The way I like to summarize the project and its goals is:

A foundation for the
next-generation of data frames

4 / 30

Yeah R is cool but have you ever

5 / 30

Yeah R is cool but have you ever

âŦ†ī¸ had a dataset bigger than memory?

5 / 30

Yeah R is cool but have you ever

âŦ†ī¸ had a dataset bigger than memory?

â†Šī¸ had data split across lots of files?

5 / 30

Yeah R is cool but have you ever

âŦ†ī¸ had a dataset bigger than memory?

â†Šī¸ had data split across lots of files?

🔀 had complex data types (map columns, struct/data-frame columns, etc.)?

5 / 30

Yeah R is cool but have you ever

âŦ†ī¸ had a dataset bigger than memory?

â†Šī¸ had data split across lots of files?

🔀 had complex data types (map columns, struct/data-frame columns, etc.)?

🔝 wanted to use more than 1 CPU (or GPUs)?

5 / 30
  • I love R: so expressive, so powerful. But some limitations, which get more critical as our data gets bigger and more complex.

These are not just limitations of R

https://wesmckinney.com/blog/apache-arrow-pandas-internals/

6 / 30
  • Of course, these aren't just R problems. Wes McKinney, creater of Pandas, talked about this problem several years ago about the Python data ecosystem. Same issues: memory-bound, handling data types, missing data, parallel processing, etc.

  • He and other _ got together, realized they were all trying to solve the same problems in their respective languages/databases/domains, and decided to join forces

Apache Arrow

🎉 Announced 2016

đŸĻš Feather package: interoperable data frame storage for R and Python, prototype of Arrow format

💡 Built on lessons of existing data frame libraries and databases

đŸ—ī¸ Shared foundation for data analysis

🤖 Designed to take advantage of modern hardware

🔗 https://arrow.apache.org/

7 / 30

Apache Arrow

Format for how data is arranged in memory: columnar, language-independent

8 / 30

Apache Arrow

Format for how data is arranged in memory: columnar, language-independent

Implementations or bindings in 11 languages

... and more

9 / 30

The arrow R package

📩 Wraps the C++ library and lets you work with these data structures efficiently in R with a familiar interface

âžĄī¸ On CRAN since August 2019: install.packages("arrow")

📅 0.16 release about to reach CRAN

🔗 https://arrow.apache.org/docs/r

10 / 30

The arrow R package

📩 Wraps the C++ library and lets you work with these data structures efficiently in R with a familiar interface

âžĄī¸ On CRAN since August 2019: install.packages("arrow")

📅 0.16 release about to reach CRAN

🔗 https://arrow.apache.org/docs/r

🌉 Nightly binaries available:

install.packages("arrow",
repos = "https://dl.bintray.com/ursalabs/arrow-r")

🌃 Nightly docs: https://ursalabs.org/arrow-r-nightly/

10 / 30

Example

11 / 30

Arrow datasets

vignette("dataset", package = "arrow")

âžĄī¸ Treat many files as a single entity

âžĄī¸ Use file paths to provide partition information

âžĄī¸ Select/filter is pushed to individual files, done in parallel

12 / 30

Arrow datasets

vignette("dataset", package = "arrow")

âžĄī¸ Treat many files as a single entity

âžĄī¸ Use file paths to provide partition information

âžĄī¸ Select/filter is pushed to individual files, done in parallel

🔜 Future development: more file formats, more storage layers (S3, HDFS, GCP, Azure), aggregation in C++

12 / 30

New feature in the 0.16 release

Reading/writing Parquet files

đŸ—„ī¸ Popular open standard binary file format for columnar data

💾 Used for I/O on nearly all modern data warehousing platforms

đŸ—œī¸ Creates small files benefiting from compression and other encodings

13 / 30

Reading/writing Parquet files

Benchmarking: see https://ursalabs.org/blog/2019-10-columnar-perf/

Benchmark Case File size Average read time
arrow::read_parquet 113 MB 4.09s
arrow::read_feather 3.96 GB 3.09s
fst::read_fst 503 MB 3.75s
data.table::fread 1.52 GB 5.09s
feather::read_feather (old) 3.96 GB 5.21s
14 / 30
15 / 30

Why Arrow?

Why not just use packages
X, Y, ..., ZZ instead?

15 / 30

Of course, Arrow isn't the only project that tries to solve these problems. Packages like XXX to handle bigger data and memory map Packages to do parallel processing And you could put your data in a database and query it with a dplyr-family package

If you like your tech stack, you can keep your tech stack

That said, there are several qualities of the Arrow project that distinguish it and make it worth considering

Arrow is a big, active project 📈

16 / 30

Arrow is a big, active project 📈

17 / 30

Lots of facets

đŸ›Ģ Flight: client-server framework for fast transport of data https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/

đŸĻ  Plasma: shared-memory object store https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/

🏹 Gandiva: LLVM expression compiler http://arrow.apache.org/blog/2018/12/05/gandiva-donation/

and more

18 / 30

Lots of projects are using Arrow

19 / 30
  • Lots of projects using Arrow as an efficient format to work with data and to transfer it
  • TODO specific projects to name-check (Tensorflow exchange? Athena federated query?)

Interoperability

20 / 30

â†”ī¸ Interchange format: e.g. get data from Spark more efficiently. rather than write out to CSV, which is row based, and have to read from disk, parse strings, infer types, transpose to columns.

Example: Spark and R

https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/

Up to 40x speedup when pulling data from Spark to R

All you have to do is library(arrow)

21 / 30

Example: Spark and R

https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/

Up to 40x speedup when pulling data from Spark to R

All you have to do is library(arrow)

Want more? Stay here for Javier's talk following this one

21 / 30

Arrow is language independent

🤝 Improve collaboration with non-R users

đŸ—Ŗ Lets us access in R projects from other languages

22 / 30

Arrow is language independent

🔜 Coming soon: reticulate support 🐍

Something like:

library(arrow)
library(reticulate)
cudf <- import("cudf")
df <- cudf$read_csv("huge_file.csv")
results <- df$groupby(c("year", "month"))$tip_amount$mean()
ggplot(results, ...)

🔗 https://rapids.ai/

23 / 30

Getting involved

24 / 30

Arrow is still a pretty young project. And while there's a lot of useful things you can do with it now---read Parquet files, read multi-file datasets, speed up Spark---there's a lot more we're working to build.

Arrow is an open-source, community driven project, and we depend on contributions from users like you to make it happen.

1. Try arrow

âŦ‡ī¸ install.packages("arrow")

âŦ‡ī¸ conda install -c conda-forge r-arrow

25 / 30

1. Try arrow

âŦ‡ī¸ install.packages("arrow")

âŦ‡ī¸ conda install -c conda-forge r-arrow

🌉 Nightly binaries available:

install.packages("arrow",
repos = "https://dl.bintray.com/ursalabs/arrow-r")

🌃 Nightly docs: https://ursalabs.org/arrow-r-nightly/

25 / 30

1. Try arrow

âŦ‡ī¸ install.packages("arrow")

âŦ‡ī¸ conda install -c conda-forge r-arrow

🌉 Nightly binaries available:

install.packages("arrow",
repos = "https://dl.bintray.com/ursalabs/arrow-r")

🌃 Nightly docs: https://ursalabs.org/arrow-r-nightly/

As of 0.16 It Just Works on Linux with no system dependencies

See vignette("install", package = "arrow")

25 / 30

1. Try arrow

remotes::install_github("apache/arrow/r")
options(repos = c(
"https://dl.bintray.com/ursalabs/arrow-r",
getOption("repos")
))
install.packages("arrow")
26 / 30
  • First, you can try to use it. arrow is on CRAN, and as of the 0.16 release, installation on Linux platforms should just work without requiring any system dependencies. Binaries for macOS and Windows are also available and work out of the box.

  • The Apache Arrow project makes official releases every few months, but we're continually adding new features and improvements. Ursa Labs hosts nightly builds at a CRAN-like repository, which you can point install.packages at.

2. Let us know what you think

âžĄī¸ Arrow is under active development

đŸ¤¯ Show us your dirty data and hairy use cases

🔗 https://issues.apache.org/jira/projects/ARROW/issues

27 / 30

Of course, we'd love to hear how Arrow works for you, good and bad.

3. Pitch in

👋 We love new contributors!

âœī¸ Improve docs etc.

28 / 30

4. Sponsor Ursa Labs

  • Support our dedicated engineering team (C++/Python/R)

  • Spearhead big projects: datasets API, query engine

  • Sustain the open-source community: bug triage, code review, CI, coordination, etc.

Contact us: info@ursalabs.org

New: GitHub Sponsors! https://github.com/sponsors/ursa-labs/

29 / 30
  • Industry consortium, several sponsors, RStudio first among them
  • Money goes to supporting our team of 7
  • We contribute a majority of the C++/Python/R code in the project,
  • and because we are able to focus full-time on the project, we can spearhead big projects like the datasets API and the fast query engine we plan to work on this year
  • We also work to sustain the open source community to make it easier for others to participate: make sure bug reports get triaged, pull requests get reviewed and merged, continuous integration, etc.

Thank you!

RStudio Community booth
Today @ 3:45pm

@ApacheArrow
@enpiar

30 / 30

Ursa Labs

  • Build cross-language, open libraries for data science

  • Grow Apache Arrow ecosystem

  • Funding and employment for full-time developers

  • Not-for-profit, funded by multiple corporations

Come talk to us at the RStudio Community booth today 3:45pm

https://ursalabs.org

2 / 30
  • Going to talk about Arrow: what it is, why you should use it, and how you can get involved in the community.
  • But first, very briefly who I am so you know where I'm coming from:
  • I'm engineering director at Ursa Labs, not-for-profit dedicated to developing open source data science tools. Main contributors to and maintainers of the Apache Arrow project.
Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k Go to previous slide
↓, →, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow