Profiling.rmd

---
title: Profiling and benchmarking
layout: default
---

```{r, echo = FALSE}
source("code/microbenchmark.R")
source("_plugins/png.R")
```

# Optimising code {#profiling}

> "Programmers waste enormous amounts of time thinking about, or worrying 
> about, the speed of noncritical parts of their programs, and these attempts 
> at efficiency actually have a strong negative impact when debugging and 
> maintenance are considered. We should forget about small efficiencies, say 
> about 97% of the time: premature optimization is the root of all evil" 
> --- Donald Knuth.

Optimising code to make it run faster is an iterative process:

1. Find the biggest bottleneck, the slowest part of your code.
1. Eliminate the bottleneck.
1. Repeat until your code is fast enough.

This process is simple, but not easy. Firstly, your intuition for bottlenecks is likely to be bad. Even experienced programmers have a hard time identifying bottlenecks from code because there are so many layers between R code and the processor. Instead of trying to guess where performance bottlenecks are, it's better to __profile__ code, running it on realistic inputs and timining how long each individual operation takes.  This is the subject of the first part of this chapter, [profiling](#measure-perf). Optimising code before you've figured out what's actually slow is called premature optimisation.

Once you've identified a specific bottleneck you need to rewrite so it runs faster. It's difficult to provide general advice on how to do this. In [improving performance](#improve-perf) you'll learn five techniques that can be applied to many different problems. The focus of this chapter is improving performance within R. In [Rcpp](#rcpp), you'll learn another powerful technique for improving performance: re-writing R bottlenecks in C++. You'll also learn a general strategy for tackling bottlenecks that helps ensure you make your code faster without making it incorrect. As computers get faster and R is optimised, your code will get faster all by itself. Your code is never going to automatically become correct or elegant if it is not already.

The bottleneck metaphor is useful because code performance is similar to fluid flowing through a pipe. Constrictions in a pipe cause energy loses and reduce flow. If you want liquid to flow through a pipe faster, you should first widen the narrowest parts. Similarly with code, as soon as you eliminate one bottleneck, a new bottleneck will arise. For this reason, you need to identify how fast the code needs to be before you start. Premature optimisation corresponds to making pipes wider without knowing which are the narrowest.

It's important to differentiate between absolute and relative speed, and fast vs fast enough. Be very wary of only looking at relative differences. One approach may be 10x faster than another, but it might only save 1 ms. Optimisation is also not free. You need to consider the costs of your time vs. computer time. You want to spend hours of your time to save days of computing time, not seconds.

##### Prerequisites

In this chapter we'll be using the `lineprof` package to understand the performance of R code, so make sure you've installed it before continuing:

```{r, eval = FALSE}
devtools::install_github(c("wch/shiny-slickgrid", "hadley/lineprof"))
```

## Measuring performance {#measure-perf}

To understand performance, you use a profiler. While there are a number of different types, R uses a fairly simple sort called a sampling or statistical profiler. Every few milliseconds, a sampling profiler stops the execution of code and records which function is currently being called and which function called that function (and so on). For example, consider the following function `f()`:

```{r, eval = FALSE}
library(lineprof)
f <- function() {
  pause(0.1)
  g()
  h()
}
g <- function() {
  pause(0.1)
  h()
}
h <- function() {
  pause(0.1)
}

tmp <- tempfile()
Rprof(tmp, interval = 0.1)
f()
Rprof(NULL)
```

(Note that we're using a `lineprof::pause()` instead of `Sys.sleep()` because it's implemented in a way that it won't appear in the profiling output.)

Conceptually, the profiler produces output like this:

```
f() 
f() > g()
f() > g() > h()
f() > h()
```

Each line represents one "tick" of the profiler (0.1s in this case). You can see the code spends 0.1 running f, then 0.1 running g (called from inside f).

In the real world, you're unlikely to get a result this nice. That's because profiling is hard to do accurately without slowing your code down by many orders of magnitude. As a compromise, `RProf()` uses a very well-established approximation technique: sampling! Basically, it stops the execution of code every `interval` seconds and inspects the current call. Because there's some variability in both the accuracy of the timer and in the time taken by each operation, each time you profile code you're likely to get a slightly different answer. Fortunately, pinpoint accuracy is not needed to identify the slowest parts of your code. 

Instead of looking at the individual records, we'll aggregate and display them using the `lineprof` package. There are a number of ways to visualise this data including `summaryRprof()`, the proftools or the profr package. But these tools are rather sophisticated. I wrote the `lineprof` package as a simpler way to visualise profiling data. While it's less powerful, it makes getting started with profiling easier because it shows you performance in the context of your code. As the name suggests, the fundamental unit of analysis in `lineprof()` is a line of code. This makes `lineprof` less precise than the alternatives (because a line of code can contain multiple function calls), but it's easier to see the context.

To use `lineprof`, we first save the code in a file. That way, we can easily index the code by line number. We then use `lineprof()` to run our function and capture the timing output. Printing this object shows some basic information. For now, we'll just focus on the time column which estimates how long each line took to run and the ref column which tells us which line of code was run (you'll learn about the other columns in [memory profiling](#memory-profiling)). The estimates aren't perfect, but the ratios look about right.  

```{r, eval = FALSE}
library(lineprof)
source("code/profiling.R")
l <- lineprof(f())
l
#>    time alloc release dups           ref     src
#> 1 0.074 0.001       0    0 profiling.R#2 f/pause
#> 2 0.143 0.002       0    0 profiling.R#3 f/g    
#> 3 0.071 0.000       0    0 profiling.R#4 f/h   
```

'lineprof' does provide some tools to navigate through this data structure (like `focus()`), but they're a bit clumsy. Instead, we'll start an interactive explorer using the `shiny` package. `shine(l)` will open a new web page (or if you're using RStudio, a new pane) that shows your source code annotated with information about how long each line took to run. `shine()` starts a shiny app which "blocks" your R session. To exit, you'll need to stop the process using escape or ctrl + c. 

```{r, echo = FALSE}
png("profiling-lineprof-f.png", dpi = 220)
```

The `t` column visualises how much time is spent on each line. While not precise, it allows you to spot bottlenecks (if you want precise numbers you can hover over the bar). You can see that twice as much time is spent on `g()` as on `h()`, so it would make sense to drill down into `g()` for more details. To do so, click `g()`:

```{r, echo = FALSE}
png("profiling-lineprof-g.png", dpi = 220)
```

Then `h()`:

```{r, echo = FALSE}
png("profiling-lineprof-h.png", dpi = 220)
```

For your own code, this should allow you to quickly identify any bottlenecks in your code.

### Limitations

There are some other limitations to profiling:

* Profiling does not extend to C code - you can see if your R code calls C/C++
  code but not what functions are called inside of your C/C++ code. Unfortunately, 
  tools for profiling compiled code are beyond the scope of this book (i.e., I 
  have no idea how to do it).

* Similarly, you can't profile either primitive functions or byte code compiled 
  code. 

* If you're doing a lot of functional programming with anonymous functions,
  it can be hard to figure out exactly which function is being called.
  The easiest way to work around this is to name your functions.

* Lazy evaluation means that arguments are often evaluated inside another 
  function. For example, in the following code, profiling would make it seem
  like `i()` was called by `j()` because the argument isn't evaluated until it's
  needed by `j().

    ```{r, eval = FALSE}
    i <- function() {
      pause(0.1)
      10
    }
    j <- function(x) {
      x + 10
    }
    j(i())
    ```
    
    If this is confusing, you can create temporary variables that force 
    computation to happen earlier.

### Exercises

1 `Rprof()` doesn't very accurately track time spend in `Sys.sleep()` 
  (presumably because it's not actually doing any computation.)

## Improving performance {#improve-perf}

Once you've used profiling to identify a bottleneck, you need to make it faster. The following sections introduce you to a number of techniques that I've found broadly useful:

1. Look for existing solutions
1. Do less work
1. Vectorise
1. Parallelise
1. Avoid copies
1. Byte-code compile

A final technique is to rewrite in a faster language, like C++. This is a big topic and is covered in the next chapter, [Rcpp](#rcpp).

Before we get into specific techniques, I'll first describe a general strategy and organisation style that’s useful when working on performance. As always, remember that clarity and readability are more important than speed. Your intuition for bottlenecks is likely to be bad, so don't sacrifice readability for performance unless you _know_ it will have a significant impact on run-time.

### Code organisation

There are two traps that are easy to fall into when trying to make your code faster:

1. Writing fast but incorrect code.
1. Assuming your code is fast.

You can avoid these pitfalls by adopting the strategy outlined below. In this trivial example, we'll compare two approaches to computing the mean.

When tackling a bottleneck, you're likely to come up with multiple approaches. Write a function for each approach. Each function should encapsulate all relevant behaviour. This will make it easier to check whether a function returns the correct result. Then, time how long each function takes to run. For our example of computing a mean, two approaches come to mind:

```{r}
mean1 <- function(x) mean(x)
mean2 <- function(x) sum(x) / length(x)
```

I recommend that you keep a record of everything you try, even the failures. If you come back to problem in the future, it'll be useful to see everything you've tried. To do this, I often use an R Markdown file, which makes it easy to intermingle code with detailed comments and notes.

Next, generate a representative test case. The case should be big enough to capture the essence of your problem but small enough that it takes only a few seconds to run. You don't want it to take too long because, to compare approaches, you'll need to run the case many times. That said, you don't want the case to be too small and abstract because the results might not generalise to the real problem. If you're interested in the performance of differently sized inputs, you may need to generate more than one test case.

```{r}
x <- runif(100)
```

Use this test case to quickly check that all variants return the same result. Check this by using `stopifnot()` and `all.equal()`. For real problems that have fewer possible outputs, you may need more tests to make sure that an approach doesn't accidentally return the correct answer.

```{r}
stopifnot(all.equal(mean1(x), mean2(x)))
```

Finally, use the `microbenchmark` package to compare how long each variation takes to run. For bigger problems, reduce the `times` paramater so that it only takes a couple of seconds to run. Focus on the median time, and use the upper and lower quartiles to gauge the variability of the measurement.

```{r}
microbenchmark(
  mean1(x),
  mean2(x)
)
```

(You might be surprised by the results here. `mean(x)` is considerably slower than `sum(x) / length(x)`. This is because, for the sake of accuracy, `mean(x)` makes two passes over the vector.)

Before you start experimenting, you should have a target speed that defines when the bottleneck is no longer a problem. Setting such a goal is important because you don't want to spend valuable time over-optimising your code.

If you'd like to see this strategy in action, I've used it a few times on Stack Overflow. 

* http://stackoverflow.com/questions/22515525#22518603
* http://stackoverflow.com/questions/22515175#22515856
* http://stackoverflow.com/questions/3476015#22511936

### Has someone already solved the problem?

Once you've organised your code and captured all the variations you can think of, it's natural to see what others have done. Remember, you are part of a large community. So it's quite possible that someone has already tried to tackle the same problem. If your bottleneck is a function in a package, it's worth looking at other packages that do the same thing. Two good places to start are:

* [CRAN task views](http://cran.rstudio.com/web/views/). If there's a
  CRAN task view related to your problem's domain, it's worth looking at
  the packages included there.

* Reverse dependencies of Rcpp, as listed on its
  [CRAN page](http://cran.r-project.org/web/packages/Rcpp). Since these
  packages use C++, it's possible to find a solution to your bottleneck
  written in a higher performance language.

Otherwise, the challenge is describing your bottleneck in a way that helps you find related problems and solutions. Knowing the name of the problem or its synonyms will make this search much easier. But because you don't know what it's called, it's hard to search for this knowledge! By reading broadly about statistics and algorithms, you can build up your own knowledge base over time. Alternatively, you can ask others. Talk to your colleagues and brainstorm some possible names, then search on Google and Stack Overflow. Note that it's often helpful to restrict your search to R related pages. For Google, try [rseek](http://www.rseek.org/). For Stack Overflow, restrict your search by including the R tag, `[R]`, in your search.

As discussed above, record all solutions that you find, not just those that immediately appear to be fast. Some solutions might initially be slower, but because they are easier to optimise they can end up being faster. You may also be able to combine the fast parts from different approaches. If you've found a solution that's fast enough, congratulations! Otherwise, read on.

### Exercises

1.  What are the faster available alternatives to `lm`? Which are
    specifically designed to work with larger datasets?

1.  What package implements a version of `match()` that's faster for
    repeated look ups? How much faster is it?

1.  List four functions (not just those in base R) that convert a string into a
    date time object? What are their strengths and weaknesses?

1.  How many different ways can you compute a 1d density estimate in R?

1.  What packages provide the ability to compute a rolling mean?

1.  What are the available alternatives to `optim()`?

### Do as little as possible

Given a function, the easiest way to make it faster is to make it do less work. Sometimes you can replace an existing function with a faster, more specific one. For example:

* `vapply()` is faster than `sapply()` because it pre-specifies the output
  type.

* `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()` are faster than
  equivalent invocations that use `apply()` because they are vectorised (the 
  topic of the next section).

* If you want to see if a vector contains a single value, `any(x == 10)`
  is much faster than `10 %in% x`. This is because testing equality is simpler 
  than testing inclusion in a set.

Having this knowledge at your fingertips requires knowing that these alternative functions even exist. Becoming literate in R starts with having a good [vocabulary](#vocabulary). The best way to expand your vocabulary is to regularly read R code, e.g., R-help or on [Stack Overflow](http://stackoverflow.com/questions/tagged/r).

Other functions will do less work if you give them more information about the problem. It's always worthwhile to carefully read the documentation and experiment with the different arguments. Some examples that I've discovered in the past include:

* `read.csv()`: specify known the columns types with `colClasses`.

* `factor()`: specify known levels with `levels`.

* `cut()`: don't generate labels with `labels = FALSE` if you don't need them
  (even better, use `findInterval()` as mentioned in the "see also" section of
  the documentation).

* `interaction()`: if you only need combinations that exist in the data, use
  `drop = TRUE`.

Sometimes you can make a function faster by avoiding method dispatch. As we saw in ([Extreme dynamism](#extreme-dynamism)), method dispatch in R can be costly. If you're calling a method in a tight loop, you can avoid some of these costs by doing the method lookup only once. For S3, you can do this by calling `generic.class()` instead of `generic()`. For S4, you can do this by using `findMethod()` to find the method, saving it to a variable, and then calling that function. For example, calling `mean.default()` instead of `mean()` is quite a bit faster for small vectors:

```{r}
x <- runif(1e2)

microbenchmark(
  mean(x),
  mean.default(x)
)
```

Note that this optimisation is a little risky. While `mean.default()` is almost twice as fast, it'll fail in surprising ways if `x` is not a vector. You should only use it if you know that the input will be a numeric vector.

Knowing that you're dealing with a specific type of input can be another way to write faster code. For example, `as.data.frame()` is quite slow because it coerces each element into a data frame and then `rbind()`s them together. But, if you had a named list with vectors of equal length, you could directly transform them into a data frame. In this case, if you're able to make strong assumptions about your input, you can write a method that's about 20x faster than the default.

```{r}
quickdf <- function(l) {
  class(l) <- "data.frame"
  attr(l, "row.names") <- .set_row_names(length(l[[1]]))
  l
}

l <- lapply(1:26, function(i) runif(1e3))
names(l) <- letters

microbenchmark(
  quickdf(l),
  as.data.frame.list(l),
  as.data.frame(l)
)
```

Again, note the tradeoff. This method is fast because it's dangerous. If you give it bad inputs, you'll get a corrupt data frame:

```{r}
quickdf(list(x = 1, y = 1:2))
```

To come up with this minimal method, I carefully read through and then rewrote the source code for `as.data.frame.list()` and `data.frame()`. I made many small changes, each time checking that I hadn't broken existing behaviour. After several hours work, I was able to isolate the minimal code shown above. This is a very useful technique. Most base R functions are written for flexibility and functionality, not performance. Thus, rewriting for your specific need can often yield substantial improvements. To do this, you'll need to read the source code. It can be complex and confusing, but don't give up!

The following example shows a progressive simplification of the `diff()` function for the case of computing differences between adjacent values in a vector. At each step, I replace one argument with a specific case, and then check to see that the function still works. The initial function is long and complicated, but by restricting the arguments I not only make it around twice as fast, I also make it easier to understand.

```{r}
# The original function, reformatted after typing diff
diff1 <- function (x, lag = 1L, differences = 1L) {
  ismat <- is.matrix(x)
  xlen <- if (ismat) dim(x)[1L] else length(x)
  if (length(lag) > 1L || length(differences) > 1L || lag < 1L || differences < 1L)
    stop("'lag' and 'differences' must be integers >= 1")

  if (lag * differences >= xlen) {
    return(x[0L])
  }

  r <- unclass(x)
  i1 <- -seq_len(lag)
  if (ismat) {
    for (i in seq_len(differences)) {
      r <- r[i1, , drop = FALSE] - r[-nrow(r):-(nrow(r) - lag + 1L), ,
        drop = FALSE]
    }
  } else {
    for (i in seq_len(differences)) {
      r <- r[i1] - r[-length(r):-(length(r) - lag + 1L)]
    }
  }
  class(r) <- oldClass(x)
  r
}

# Step 1: Assume vector input. This allows me to remove the is.matrix()
# test and the method that use matrix subsetting.
diff2 <- function (x, lag = 1L, differences = 1L) {
  xlen <- length(x)
  if (length(lag) > 1L || length(differences) > 1L || lag < 1L || differences < 1L)
    stop("'lag' and 'differences' must be integers >= 1")

  if (lag * differences >= xlen) {
    return(x[0L])
  }

  i1 <- -seq_len(lag)
  for (i in seq_len(differences)) {
    x <- x[i1] - x[-length(x):-(length(x) - lag + 1L)]
  }
  x
}
diff2(cumsum(0:10))

# Step 2: assume difference = 1L. This simplifies input checking
# and eliminates the for loop
diff3 <- function (x, lag = 1L) {
  xlen <- length(x)
  if (length(lag) > 1L || lag < 1L)
    stop("'lag' must be integer >= 1")

  if (lag >= xlen) {
    return(x[0L])
  }

  i1 <- -seq_len(lag)
  x[i1] - x[-length(x):-(length(x) - lag + 1L)]
}
diff3(cumsum(0:10))

# Step 3: assume lag = 1L. This eliminates input checking and simplifies
# subsetting.
diff4 <- function (x) {
  xlen <- length(x)
  if (xlen <= 1) return(x[0L])

  x[-1] - x[-xlen]
}
diff4(cumsum(0:10))

x <- runif(100)
microbenchmark(
  diff1(x),
  diff2(x),
  diff3(x),
  diff4(x)
)
```

Once you've read [Rcpp](#rcpp) you'll be able to make `diff()` even faster for this case.

A final example of doing less work is to use simpler data structures. For example, when working with rows from a data frame, it's often much faster to work with row indices than data frames. For instance, if you wanted to compute a bootstrap estimate of the correlation between two columns in a data frame, there are two basic approaches: you can either work with the whole data frame or with the individual vectors. The following example shows that working with vectors is about twice as fast.

```{r}
sample_rows <- function(df, i) sample.int(nrow(df), i, replace = TRUE)

# Generate a new data frame containing randomly selected rows
boot_cor1 <- function(df, i) {
  sub <- df[sample_rows(df, i), , drop = FALSE]
  cor(sub$x, sub$y)
}

# Generate new vectors from random rows
boot_cor2 <- function(df, i ) {
  idx <- sample_rows(df, i)
  cor(df$x[idx], df$y[idx])
}

df <- data.frame(x = runif(100), y = runif(100))
microbenchmark(
  boot_cor1(df, 10),
  boot_cor2(df, 10)
)
```

### Exercises

1.  How do the results change if you compare `mean()` and `mean.default()`
    on 10,000 observations, rather than on 100?

1.  Make a faster version of `chisq.test()` that only computes the Chi-square
    test statistic when the input is two numeric vectors with no missing
    values. You can try by starting with either `chisq.test()`, making it 
    simpler, or with the formal [definition](http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test).

1.  Can you make a faster version of `table()` for the case of an input of
    two integer vectors with no missing values? Can you use it to
    speed up your Chi-square test?

### Vectorise

If you've used R for any length of time, you've probably heard the admonishment to "vectorise your code". But what does that actually mean? Vectorising your code is not just about avoiding for loops (although that's often a step), it's more about using functions that work with a vector as a whole, rather than with the vector's components. There are two key attributes of a vectorised function:

* It makes many problems simpler. Instead of having to think about the 
  components of a vector, you only think about vectors as a single 
  objects.

* Most vectorised functions still use for loops, but they are written in C 
  instead of R. For loops in C are much faster because they have much less 
  overhead.

[Functionals](#functionals) stressed the importance of vectorised code as a higher level abstraction. Vectorisation is also important for writing fast R code. This, however, doesn't simply mean using `apply()` or `lapply()`, or even `Vectorise()`. Those functions just improve the interface of a function by using vectorised syntax. They don't fundamentally change performance. In R, using vectorisation for performance reasons really means finding the existing R function that most closely applies to your problem and that is already implemented in C. 

Vectorised functions that apply to many common performance bottlenecks include:

* `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()`. These vectorised 
  matrix functions are will always be faster than using `apply()`. You can  
  sometimes use these functions to build other vectorised functions. 
  
    ```{r}
    rowAny <- function(x) rowSums(x) > 0
    rowAll <- function(x) rowSums(x) == ncol(x)
    ```
* Vectorised subsetting can lead to big improvements in speed. Remember the 
  techniques behind lookup tables ([lookup tables](#lookup-tables)) and matching 
  and merging by hand ([matching and merging by hand](#matching-merging)). Also 
  remember that you can use subsetting assignment to replace multiple values in 
  a single step: `x[is.na(x)] <- 0` will replace all missing values in `x` with 
  0 if `x` is a vector, matrix or data frame.

* If you're converting continuous to categorical values make sure you know
  how to use `cut()` and `findInterval()`.

* Be aware of vectorised functions like `cumsum()` and `diff()`.

That said, vectorisation is challenging because it's hard to predict how different operations will scale. The following example measures how long it takes to use character subsetting to lookup 1, 10 and 100 elements from a list. You might expect that looking up 10 elements would take 10x as long as looking up 1, and that looking up 100 elements would take 10x longer than looking up 10. In fact, the following example shows that it only takes about 8 times longer to lookup 100 elements than it does to lookup 1.

```{r}
lookup <- setNames(as.list(sample(100, 26)), letters)

x1 <- "j"
x10 <- sample(letters, 10)
x100 <- sample(letters, 100, replace = TRUE)

microbenchmark(
  lookup[x1],
  lookup[x10],
  lookup[x100]
)
```

A special case of vectorisation is matrix algebra, where the loops are done by highly tuned external libraries like BLAS. If you can figure out a way to use matrix algebra to solve a problem, it will often be very fast. Unfortunately using matrix algebra is usually just a matter of recognising a special trick. If you work a lot in a particular domain, you'll start to recognise common patterns and useful tricks. Until you do, you'll need to rely on others to help you spot the tricks.

Vectorisation won't solve every problem, and rather than torturing an existing algorithm into one that uses a vectorised approach, you're often better off writing your own vectorised function in C++. You'll learn how to do so in [Rcpp](#rcpp). 

### Exercises

* How can you use `crossprod()` to compute a weighted sum? How much faster is
  it than the naive `sum(x * w)`?

### Avoid copies

A pernicious source of slow R code is growing an object with a loop. Whenever you use `c()`, `append()`, `cbind()`, `rbind()` or `paste()` to create a bigger object, R must first allocate space for the new object and then copy the old object to its new home. If you're repeating this many times, as with a for loop, this can be quite expensive. This is known as Circle 2 in the [R inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf).

Here's a little example that benchmarks the difference in execution time. We first generate some random strings, and then combine them either iteratively with a loop using `collapse()`, or in a single pass using `paste()`. Note that the performance of `collapse()` gets relatively worse as the number of strings grows: combining 100 strings takes almost 30 times longer than combining 10 strings.

```{r}
random_string <- function() {
  paste(sample(letters, 50, replace = TRUE), collapse = "")
}
strings10 <- replicate(10, random_string())
strings100 <- replicate(100, random_string())

collapse <- function(xs) {
  out <- ""
  for (x in xs) {
    out <- paste0(out, x)
  }
  out
}

microbenchmark(
  collapse(strings10),
  collapse(strings100),
  paste(strings10, collapse = ""),
  paste(strings100, collapse = "")
)
```

Modifying an object in a loop, e.g. `x[i] <- y`, can also create copies, depending on `x`'s class. [Modification in place]{#modification-in-place} discusses this problem in more depth and gives you tools to determine when it only looks like you're modifying an object in place. Data frames are particularly bad offenders.

Note that vectorised code avoids both of these potential problems by modifying or creating an object just once.

### Byte code compilation

R 2.13.0 introduced a byte code compiler which can increase the speed of certain types of code. Using the compiler is an easy way to get improvements in speed. It's also easy to do. So if it doesn't work well for your function, you won't have invested a lot of time in the effort. The following example shows the pure R version of `lapply()` from [functionals](#lapply). Compiling it gives a considerable speedup, although it's still not quite as fast as the C version provided by base R.

```{r}
lapply2 <- function(x, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], ...)
  }
  out
}

lapply2_c <- compiler::cmpfun(lapply2)

x <- list(1:10, letters, c(F, T), NULL)
microbenchmark(
  lapply2(x, is.null),
  lapply2_c(x, is.null),
  lapply(x, is.null)
)
```

This is a relatively good example of byte code compilation. In most cases, however, you're more likely to get a 5-10 % improvement. This particular example optimises well because it uses a for loop, something that is generally rare in R.

Note that all base R functions are now byte code compiled by default.

### Exercises

1. Imagine you want to compute the bootstrap distribution of a sample
    correlation using `cor_df()` and the data in the example below. Given that you 
    want to run this many times, how can you make this code faster? (Hint: the 
    function has three components that you can speed up.)

    ```{r, eval = FALSE}
    n <- 1e6
    df <- data.frame(a = rnorm(n), b = rnorm(n))

    cor_df <- function(i) {
      i <- sample(seq(n), n * 0.01)
      cor(q[i, , drop = FALSE])[2,1]
    }
    ```

    Is there a way to vectorise this procedure?

### Case study: t-test

The following case study explores how to make t-tests faster by exploiting only vectorised functions. This case study is based on an example in [Computing thousands of test statistics simultaneously in R](http://stat-computing.org/newsletter/issues/scgn-18-1.pdf) by Holger Schwender and Tina Müller. I thoroughly recommend reading the paper in full to see the same idea applied to other tests.

Imagine we have run 1000 experiments, each of which collects data on 50 individuals. The first 25 individuals in each experiment are assigned to group 1 and the rest to group 2. We'll generate some random data to represent this data.

```{r}
m <- 1000
n <- 50
X <- matrix(rnorm(m * n, mean = 10, sd = 3), nrow = m)
grp <- rep(1:2, length = n)
```

For data in this form, there are two basic ways to use `t.test()`. We can either use the formula interface or provide two vectors, one for each group. Timing reveals that the formula interface is considerably slower.

```{r, cache = TRUE}
system.time(for(i in 1:m) t.test(X[i, ] ~ grp)$stat)
system.time(for(i in 1:m) t.test(X[i, grp == 1], X[i, grp == 2])$stat)
```

Of course, a for loop just computes but doesn't save values. So we probably actually want to use `apply()`. This will add a little overhead:

```{r}
compT <- function(x, grp){
  t.test(x[grp == 1], x[grp == 2])$stat
}
system.time(apply(X, 1, compT, grp = grp))
```

How can we make this faster? First, we could try doing less work. If you look at the source code of `stats:::t.test.default()`, you'll see that it does a lot more than just compute the t-statistic. It also computes the p-value and formats the output for printing. We can try to make our code faster by stripping out those pieces.

```{r}
my_t <- function(x, grp) {
  t_stat <- function(x) {
    m <- mean(x)
    length <- length(x)
    var <- sum((x - m) ^ 2) / (n - 1)

    list(m = m, n = n, var = var)
  }

  g1 <- t_stat(x[grp == 1])
  g2 <- t_stat(x[grp == 2])

  pooled_se <- sqrt(g1$var / g1$n + g2$var / g2$n)
  (g1$m - g2$m) / pooled_se
}
system.time(apply(X, 1, my_t, grp = grp))
```

This gives us about a 5x speed improvement.

Now that we have a fairly simple function, we can make it faster still by vectorisation. Instead of looping over the array outside the function, we vectorise the function by modifying `t_stat()` to work with a matrix of values instead of a vector. Thus, `mean()` becomes `rowMeans()`, `length()` becomes `ncol()`, and `sum()` becomes `rowSums()`. The rest of the code stays the same.

```{r}
rowtstat <- function(X, grp){
  t_stat <- function(X) {
    m <- rowMeans(X)
    n <- ncol(X)
    var <- rowSums((X - m) ^ 2) / (n - 1)

    list(m = m, n = n, var = var)
  }

  g1 <- t_stat(X[, grp == 1])
  g2 <- t_stat(X[, grp == 2])

  pooled_se <- sqrt(g1$var / g1$n + g2$var / g2$n)
  (g1$m - g2$m) / pooled_se
}
system.time(rowtstat(X, grp))
```

That's much faster! It's at least 40x faster than our previous effort, and around 1000x faster than where we started.

Finally, we could try byte code compilation. Here we'll need to use `microbenchmark()` instead of `system.time()` in order to get enough accuracy to see a difference:

```{r}
rowtstat_bc <- compiler::cmpfun(rowtstat)

microbenchmark(
  rowtstat(X, grp),
  rowtstat_bc(X, grp)
)
```

In this example, byte code compilation doesn't help at all.

### Parallelise

Parallelising your code doesn't save computer time, it saves your time. Parallelisation works because multiple processors or cores are used to simultaneously work on different parts of you problem. Parallel computing is a complex field, and there's no way to cover it in depth here. Some resources I recommend are:

* [Parallel R](http://amazon.com/B005Z29QT4) by Q. Ethan McCallum and Stephen Weston.

* [Parallel computing for data science](http://heather.cs.ucdavis.edu/paralleldatasci.pdf), by
  Norm Matloff.

What I want to focus on is a simple application of parallel computing to what are called embarrassingly parallel problems. When your problem has many simple parts that can be solved independently, it's very easy to spread computation across the cores on your computer. For example, if you have a for loop, or equivalent `lapply()`, you can easily run each operation in parallel. This is particularly easy in Linux and Mac OS because you simply substitute `mclapply()` for `lapply()`. The following code snippet runs a trivial (but slow) function on all cores of your computer.

```{r}
library(parallel)
cores <- parallel::detectCores()
cores

pause <- function(i) {
  function(x) Sys.sleep(i)
}

system.time(lapply(1:10, pause(0.25)))
system.time(mclapply(1:10, pause(0.25), mc.cores = cores))
```

Life is a bit harder in Windows. You need to first set up a local cluster and then use `parLapply()`:

```{r}
cluster <- parallel::makePSOCKcluster(cores)
system.time(parLapply(cluster, 1:10, function(i) Sys.sleep(1)))
```

The main difference between `mclapply()` and `makePSOCKcluster()` is that the individual processes generated by `mclapply()` inherit from the current process, while those generated by `makePSOCKcluster()` start a fresh session. This means that most real code will need some setup. Use `clusterEvalQ()` to run arbitrary code on each cluster and load needed packages, and `clusterExport()` to copy objects in the current session to the remote sessions.

```{r, error = TRUE}
x <- 10
psock <- parallel::makePSOCKcluster(1L)
clusterEvalQ(psock, x)

clusterExport(psock, "x")
clusterEvalQ(psock, x)
```

Note there is some communication overhead with parallel computing. If the subproblems are very small, then parallelisation might actually hurt rather than help. It's also possible to distribute computation over a network of computers (not just the cores on your local computer) but that's beyond the scope of this book. (It gets increasingly complicated to balance computation against the costs of communication.) A good place to start for more information is the [High performance computing](http://cran.r-project.org/web/views/HighPerformanceComputing.html) CRAN task view.

### Other techniques

Being able to write fast R code is part of being a good R programmer. Beyond the specific hints in this chapter, if you want to write fast R code, you'll need to improve your general programming skills. Some ways to do this are to:

* [Read R blogs](http://www.r-bloggers.com/) to see what performance
  problems other people have struggled with, and how they have made their
  code faster.

* Read other R programming books, like Norm Matloff's
  [The Art of R Programming](http://amazon.com/1593273843). Read the
  [R inferno](http://www.burns-stat.com/documents/books/the-r-inferno/) to
  learn about common traps.

* Take an algorithms and data structure course to learn some theory and
  well known ways of tackling certain classes of problems. I have heard
  good things about Princeton's
  [Algorithms](https://www.coursera.org/course/algs4partI) course offered on
  Coursera.
  
* Read general books about optimisation like 
  [Mature optimisation](http://carlos.bueno.org/optimization/mature-optimization.pdf)
  by Carlos Bueno, or the [Pragmatic Programmer](http://amazon.com/020161622X) 
  by Andrew Hunt and David Thomas.
  
You can also reach out to the community for help. Stack Overflow can be a useful resource. But you'll need to put some effort into creating an easily digestible example that also captures the salient features of your problem. If your example is too complex, few people will have the time and motivation to attempt a solution. If it's too simple, you'll get answers that solve the toy problem but not the real problem. If you try to answer questions on Stack Overflow, you'll quickly get a feel for what makes a good question.