Speeding Up Data Access With Apache Arrow

# Speeding Up Data Access With Apache Arrow
### Wes McKinney <a href="https://twitter.com/wesmckinn">@wesmckinn</a> Neal Richardson <a href="https://twitter.com/enpiar">@enpiar</a>
### August 15, 2020 Slides: <a href="https://enpiar.com/talks/nyr-2020/">enpiar.com/talks/nyr-2020/</a>

---

<div class="my-footer">https://enpiar.com/talks/nyr-2020/</div>

---

# Ursa Labs

https://ursalabs.org

]

* Build cross-language, open libraries for data science

* Grow **Apache Arrow** ecosystem

* Funding and employment for full-time developers

* **Not-for-profit**, funded by multiple corporations

]
]

---

## Current generation data frame (tabular) computing is highly inefficient

* High fraction of compute spent on **serialization** (converting between data formats)

* Inefficient in-memory computing that **fails to fully utilize modern hardware capabilities**

* Much developer time spent building data connectors and maintaining **glue code**

### *Our mission is to make scalable data processing faster, simpler, and more cost-efficient for the world’s data scientists*

---

# <img src="https://arrow.apache.org/img/arrow.png" height="100" />

* Started 2016, 1.0 release July 2020

* Shared foundation for data analysis, built on lessons of existing data frame libraries and databases

* Designed to take advantage of modern hardware

🔗 https://arrow.apache.org/

---

# <img src="https://arrow.apache.org/img/arrow.png" height="100" />

**Format** for how data is arranged in memory: columnar, language-independent

]

]
]

---

# <img src="https://arrow.apache.org/img/arrow.png" height="100" />

**Format** for how data is arranged in memory: columnar, language-independent

<img src="img/simd.png" />
]

<img src="img/language_logos.png" />
... and more

]
]

---

# Thriving open-source community

---

# The arrow R package

### CRAN release

```r
install.packages("arrow")
```

https://arrow.apache.org/docs/r/

### Nightly dev builds

```r
arrow::install_arrow(nightly = TRUE)
```

https://ursalabs.org/arrow-r-nightly/

---

# rstudio::conf, 6 ~~years~~ months ago

Demo of reading a multi-file Parquet dataset

https://enpiar.com/talks/rstudio-conf-2020/demo.html

125 files, ~2 billion rows

```r
ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))

system.time(ds %>%
  filter(total_amount > 100, year == 2015) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  group_by(passenger_count) %>%
  collect() %>%
  summarize(
    tip_pct = median(100 * tip_amount / total_amount),
    n = n()
  ) %>%
  print())
```

---
# rstudio::conf, 6 ~~years~~ months ago

```r
## # A tibble: 10 x 3
## passenger_count tip_pct n
## <int> <dbl> <int>
## 1 0 9.84 380
## 2 1 16.7 143087
## 3 2 16.6 34418
## 4 3 14.4 8922
## 5 4 11.4 4771
## 6 5 16.7 5806
## 7 6 16.7 3338
## 8 7 16.7 11
## 9 8 16.7 32
## 10 9 16.7 42

##    user  system elapsed
##  26.735   1.159   4.076
```

---

# rstudio::conf, 6 ~~years~~ months ago

*## user system elapsed
*## 3.829 3.108 1.842 <----------- 2x faster today
```

---
class: inverse, center, middle

---

# But what if I don't have Parquet files?

---

# 1. Read multi-file CSV datasets

### Included in 1.0

```r
ds <- open_dataset("nyc-taxi/csv/2019", format = "csv",
 partitioning = "month")
ds

## FileSystemDataset with 6 csv files
## vendor_id: int64
## pickup_at: timestamp[s]
## dropoff_at: timestamp[s]
## passenger_count: int64
## trip_distance: double
## rate_code_id: int64
...
```

---

# 1. Read multi-file CSV datasets

### Included in 1.0

```r
system.time(ds %>%
  filter(payment_type == 3, total_amount > 10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  group_by(passenger_count) %>%
  collect() %>%
  summarize(
    tip_pct = mean(100 * tip_amount / total_amount),
    n = n()
  ) %>%
  print())
```

---

# 1. Read multi-file CSV datasets

### Included in 1.0

```r
## # A tibble: 7 x 3
## passenger_count tip_pct n
## <int> <dbl> <int>
## 1 0 0.0275 5588
## 2 1 0.0121 73385
## 3 2 0.0113 15918
## 4 3 0.00626 4041
## 5 4 0.00558 2981
## 6 5 0 107
## 7 6 0 55

##    user  system elapsed
##  27.951  14.728   7.639
```

---