enpiar

Should You Add a Package Dependency?

When you're writing an R package, you're writing software, and your decision to depend on another library should be made carefully. Here are some ways to use R itself to get more information about the cost of adding a dependency to your package.

At Crunch, we’ve recently hired an “R Team” to build out our package offerings beyond what I could write in my spare time. It’s been great to have more resources to dedicate to our R tools, and it’s been particularly fun for me because now I get to have package design debates with other smart humans, not just inside my head.

In one such debate, we discussed when it was appropriate to add a new dependency to a package. When you’re scripting and doing data analysis on your computer, and there’s a library out there that solves a problem you’re working on, by all means use it. But, when you’re doing package development, you’re writing software, software that you’re asking other people to install and run on their computers. The standard is different.

In software, adding a new library isn’t free (or, it’s free as in puppy). There’s a cost-benefit calculus to do, and while the benefits of using a function from another package may be clear, the costs are harder to pin down. In addition to installation time and bandwidth, the more dependencies you add, the more you risk exposing your users to version mismatches among packages, and the more you create opportunities for changes in someone else’s code to break yours. While that may seem like hyperbolic risk aversion, crazy things do happen in software development from time to time.

Of course, if you’re trying to do something specialized and there’s a library that’s the standard in that space, you shouldn’t reinvent the wheel. But, much of what we do in R is general data manipulation—data comes in one format and needs to get reshaped into some other structure—and while there are many utility libraries to make that easier, there’s also a lot built into R itself that’s quite powerful.

After our discussion, I found myself wondering a few things. How could I better determine the cost of a new library? Not all dependencies are equal, and in fact even the number of new dependencies the library brings might not just be one (itself). It could be greater than one if the new package has its own dependencies it brings along. Or, it could actually be zero—you may already depend on it indirectly, through something you explicitly depend on—in which case the debate is moot.

It turns out that you can get quite a bit of information about packages and their dependencies from within R itself by querying CRAN’s package database. By doing a little exploration in that data, you can make a more informed judgment about whether adding a new dependency is potentially costly or really not a big deal.

Analyzing CRAN packages within R

The recently added tools::CRAN_package_db function returns a data.frame of information about the packages on CRAN. It has a row for each package and columns for all of the possible fields in the DESCRIPTION file of a package.

cran <- tools::CRAN_package_db()
dim(cran)
## [1] 11180    65

Parsing package dependency fields

The “Depends” and “Import” columns are string lists of packages (or R itself), comma-separated, sometimes with spaces too (and sometimes with newlines), and sometimes with versions specified. As is, they’re not pretty:

head(cran[, c("Package", "Depends", "Imports")])
##       Package                                             Depends
## 1          A3                      R (>= 2.15.0), xtable, pbapply
## 2      abbyyR                                        R (>= 3.2.0)
## 3         abc R (>= 2.10), abc.data, nnet, quantreg, MASS, locfit
## 4 ABCanalysis                                         R (>= 2.10)
## 5    abc.data                                         R (>= 2.10)
## 6    abcdeFBA              Rglpk,rgl,corrplot,lattice,R (>= 2.10)
##                                  Imports
## 1                                   <NA>
## 2 httr, XML, curl, readr, plyr, progress
## 3                                   <NA>
## 4                                plotrix
## 5                                   <NA>
## 6                                   <NA>

In order to collect the set of packages that a given package depends on, we need to turn those strings into lists (vectors) of package names. Here’s a function that does that: splits each element on comma, and for each of those, trims whitespace, version specifications, and superfluous content.

parsePackages <- function (x) {
    lapply(strsplit(x, ","), function (pkgs) {
        ## Strip whitespace
        pkgs <- gsub("[ \n]", "", pkgs)
        ## Remove version specifications, in parentheses
        pkgs <- gsub("\\(.*$", "", pkgs)
        ## Ignore NA for when the field was empty, and exclude "R" because duh
        return(setdiff(na.omit(pkgs), "R"))
    })
}

parsePackages(head(cran$Depends))
## [[1]]
## [1] "xtable"  "pbapply"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "abc.data" "nnet"     "quantreg" "MASS"     "locfit"  
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "Rglpk"    "rgl"      "corrplot" "lattice"

Now, to get a complete sense of the dependencies of a package, we need to combine both the “Depends” and “Imports” fields. We could paste the string columns together and call parsePackages on that result, but this looked like an excellent opportunity to use mapply. Unlike the other *apply functions in R, mapply takes as its first argument a function to map over some data structures of the same length. In this case, for each row in the package list, we want the union of the packages in “Depends” and “Imports”. So, we can map the union function over them:

deps <- mapply(union,
    x=parsePackages(cran$Depends),
    y=parsePackages(cran$Imports))
names(deps) <- cran$Package

head(deps)
## $A3
## [1] "xtable"  "pbapply"
## 
## $abbyyR
## [1] "httr"     "XML"      "curl"     "readr"    "plyr"     "progress"
## 
## $abc
## [1] "abc.data" "nnet"     "quantreg" "MASS"     "locfit"  
## 
## $ABCanalysis
## [1] "plotrix"
## 
## $abc.data
## character(0)
## 
## $abcdeFBA
## [1] "Rglpk"    "rgl"      "corrplot" "lattice"

Now we have a named list where for each package, the contents are every package listed in either Depends or Imports in the package DESCRIPTION.

Finding all dependencies

That’s great, but we want to know the complete set of package dependencies. Each direct dependency may have its own dependencies, and so on. We can solve this with a short function that takes a package name, looks up its dependencies, looks up the dependencies of those, and keeps tracing back through dependencies until no new ones are added.

findDependencies <- function (pkg) {
    out <- c()
    newdeps <- deps[[pkg]]
    while (length(newdeps)) {
        out <- c(out, newdeps)
        newdeps <- setdiff(unlist(deps[newdeps]), unlist(out))
    }
    return(out)
}

So, for example, the crunch package, the subject of our discussion at work, formally lists these five dependencies

cran[cran$Package == "crunch", c("Depends", "Imports")]
##           Depends
## 1792 R (>= 3.0.0)
##                                                                          Imports
## 1792 httr (>= 1.0.0), httpcache (>= 0.1.4), jsonlite (>= 0.9.15),\nmethods, curl
deps$crunch
## [1] "httr"      "httpcache" "jsonlite"  "methods"   "curl"

but tracing through all of its dependencies’ dependencies, actually requires five others.

crunch <- findDependencies("crunch")
crunch
##  [1] "httr"      "httpcache" "jsonlite"  "methods"   "curl"     
##  [6] "mime"      "openssl"   "R6"        "digest"    "tools"

Counting the true number of new dependencies

Already with this information, we can make more informed decisions about whether it is expensive to add a certain dependency. In this example, if a contributor to crunch wanted to import a hashing function from the digest package, or wanted to define a class with R6, there’s no additional cost because we already depend on it—just not directly.

What about another package? Suppose someone wanted to use a function from purrr instead of one of the *apply functions that are a part of R. What is the impact on our total dependency count?

purrr <- findDependencies("purrr")
setdiff(purrr, crunch)
## [1] "magrittr" "tibble"   "rlang"    "Rcpp"     "utils"
intersect(purrr, crunch)
## [1] "methods"

Adding that one dependency actually adds a total of six packages (itself and five others) to our total dependency load, though one of those, utils, is part of the standard R installation. The only shared dependency between the two packages is methods, also a part of R itself.

Weighing the costs

Not all packages are alike, of course. They can vary in size, complexity, and stability. How should we interpret this number of potential new dependencies?

The CRAN package database accessible in R doesn’t give information on size or complexity of the code, and in terms of stability, you could look at the “Published” field to see how recently updated the packages are. But, a single date alone is hard to interpret: a recent update could mean that the package is undergoing rapid development and is risky to depend on, or it could mean that it is actively being supported and bugs are getting fixed. Or it could mean that you caught the developer on the one day a year they push updates.

The best proxy for complexity that I see in the CRAN database is the “NeedsCompilation” field. If something needs to compile C or FORTRAN code in order to install, it’s usually slower to install, and you run into the possibility of platform and compiler differences across systems. CRAN does a very good job of ensuring that packages are able to run on a wide range of platforms, but the most common reason I’ve had a package fail to install for me was because the C compilation step couldn’t find a header file it needed, or it expected a different compiler.

So what about these new potential dependencies?

table(cran$NeedsCompilation[cran$Package %in% c("purrr", purrr)])
## 
##  no yes 
##   1   4

It appears that four of the five package dependencies (those that aren’t built in to R itself) require compilation, so by this measure, the complexity of the dependencies is relatively high.

Prevalence of the new packages

Another question you might want to consider is how likely it is that the new packages you’re adding might already be present on your users’ systems. If you believe that most of your users already have the one you want to add, you might be less concerned about its costs and risks.

I’m not going to dive into that analysis here, but a few options come to mind. First, you could do a similar dependency tracing using the “Reverse depends” and “Reverse imports” columns of the CRAN database. These columns indicate packages that depend on the package in question, rather than packages on which it depends. You could collect those for the prospective package, and if you find that there are lots of others that depend on it, you may think it is more likely to be found on your users’ systems already.

Second, you could look at package download counts and try to infer whether the package you want to depend on is popular. The cranlogs package provides an interface to RStudio’s CRAN mirror download logs. It’s not complete—there are many CRAN mirrors—but it can give you a data point.

What you’d really like to know is how likely it is that someone who has installed your package has also installed the other package, and I don’t know where you’d get that kind of individual-level data. One interesting approach along these lines was in David Robinson’s talk at the R NYC conference this year. He looked at Stack Overflow questions and answers and did some analysis of which packages were most likely to appear together. Something like that could be useful information… if your package is popular enough to be asked about on Stack Overflow a lot.

No dependencies were harmed in the making of this post

You may have noticed that I did not load any external libraries in this post. That’s partly because I’m a dinosaur who did most of his heavy data analysis in R a few years ago, before the emergence of the so-called “tidyverse”. I learned how R works as a language, both the good and the bad, and I’m comfortable working with those tools.

However, I also did it to prove a point: R is great at manipulating data structures and doing simple analyses just using what comes built-in. A few basic regular expressions, some set operations, and a dash of functional programming, and we got the data into a shape where we could easily answer the questions at hand.

If the package on which you want to depend is core to the functionality that your package provides, you should use it. But if you’re adding it just to make your life a little easier as a developer, consider sacrificing your convenience for the good of your users. Reasonable people may disagree on where the threshold is for deciding it is worth adding the new dependency, but with more information on the costs associated, you can make a better decision.

Published in code and tagged R, dependencies and packages