Last week I was taking some colleagues through the code for my COVID19 PDF scraping and afterwards one sent me a message asking about a chunk of code that used the dplyr::case_when() function. In particular they wanted to know why case_when() uses the tilde (~)1, which led to a bit more of a generalised conversation about case_when() and how it works.
In your script, when you assign the values to the entity and position columns inside your case_when() call why do you use ~ rather than = or <-? It felt too specific and potentially obvious to others to ask during your talk. I understand why it’s used in model building for regression etc… but not really in this context.
This is a great question and formulating the answer actually helped crystallise something I knew about R but hadn’t really processed before then.
The documentation for dplyr::case_when() calls it a “A general vectorised if”, which is a good explanation if you’re well-versed in R. But a lot of people aren’t. The best way to describe it is that helps you to avoid nasty nested if conditions. You remember when you used Excel2 and you’d want a cell to display something based on an entry in another cell and so you combined IF() functions together, well case_when() is basically a much nicer and easier to process approach to that. Let’s take our example from the code chunk, we have a set of x and y coordinates and we want to generate the entity given the values of x and y. Let’s also pretend we’re back using Excel (or Sheets, or you spreadsheeting application of choice), so we’ve got a table with our x and y values in columns A and B and we’ve written a formula to provide the entity in column C.
There’s a fair potential for error here. Firstly, it only does one test of y, so if the value in A6 was 268 it would still continue to process the second-half of these nested IF statements, the same is also the case for the final set of x values, no matter the value in B4 you’ll get parks if A4 is 104, and in B7 no matter what values you have in either A7 or B7 you’ll get residential.
We could write this set of IF statements that is more explicit, and will return FALSE in any cell where there are incorrect values:
This does the job and is error-proof in the sense that it only returns valid results for valid inputs. However, if you’re anything like me then your Excel formulae are lazy and so you’d have written the former. Plus you’re not protected against copy/paste, transposition or overwriting errors.
case_when() to the rescue
Base R has the ifelse() function, and {dplyr} has a stricter interpretation if_else(), we could nest a set of calls to either of these just as we do in Excel (ifelse(x==1,"a", ifelse(x==2, "b", "c"))). But we can avoid messy code by using dplyr::case_when().
It turns out I use case_when() a lot in my code. Some 92 of the 211 .R and .Rmd files in my live R project folders contain at least one call to case_when, a whopping 43.6% of my current scripts! Before today I knew that I used it fairly regularly, but if you’d have asked me to guess I’d probably have said that maybe a quarter or a third of my code used it, not almost half!!
So what makes case_when() so great? Well it’s a very handy function for handling multiple conditions. Let’s take a subset of the code from above, we’ve got a tibble that contains the variables x and y, we’re filtering it to three specific values of y and then creating a new variable called entity using case_when().
While you can use case_when() working outside of data.frame operations that’s not something I’ve done and I think more conventional nested base::if() statements are better, case_when() is designed for working with its {dplyr} siblings, especially dplyr::mutate(). The arguments to case_when() are a set of formula pairs, which is where our friend the tilde (~) from my colleague’s question comes in. R’s formula notation is something you’ll mainly come across when constructing a linear model, e.g. y ~ x + z, but it’s not restricted to its use in model building. A formula is a type of object that R explicitly understands to be a pair of items, a left-hand side (LHS) and a right-hand side (RHS) split by a tilde, which we can generalise this as LHS ~ RHS. This two-part nature of a formula pair is what case_when() is exploiting.
case_when() uses the LHS part as the condition to test for and the RHS as the value to return, it works in sequence through the conditions and when it obtains a TRUE result it will return the value associated with that condition. Our first pair, y == 36 ~ "location, tests if y is equal to 36 and if so then it returns “location”. The next pair, y == 104 & x == 36 ~ "retail_recr", tests if y is equal to 104 AND x is equal to 36. This code replicates the second Excel formula written above, but I certainly find it easier to read and follow.
There are a couple of important things to remember when using case_when(). Firstly, it’s crucial that the all of the items in your RHS are of the same type, i.e. they’re all character strings, or they’re all numerics, if they’re not the same type then the call will fail. Secondly, if a row in your data doesn’t match any of the conditions then it’ll get an NA value, but what if you wanted to have a catch-all value for all other cases, well that’s pretty simple. case_when() is testing conditions and looking for a TRUE result, so all you need to do is add a final line of the form TRUE ~ value
df<-tibble(val=c(1,2,1,3,4,5))%>%mutate(new_val=case_when(val==1~"One",val>3~"More than three",TRUE~"Other"))
In this code we look at val and if it’s equal to 1 or greater than 3 then we’ll get the response we’ve specified, but in all other cases we’ll get the value “Other”.
> df
| val|new_val |
|---:|:----------------|
| 1 | One |
| 2 | Other |
| 1 | One |
| 3 | Other |
| 4 | More than three |
| 5 | More than three |
So when’s the case for case_when()? Almost half the time if my code is anything to go by.
This opens up a segue to the best R prank ever, a tweet that has since been deleted: Today we were learning R Coding and I made a function so that whenever someone typed a ~ it automatically added “Swinton” after it. Three hours it took my colleagues to get it, three hours. - Joe (@raptorbaitjoe) May 29, 2019. ↩︎