When I began using R, I noticed that syntax could become quite complex when nesting functions, or assigning values to variables from functions. If you’ve ever perused blog posts abour R online, you’ll quite often see authors utilize the left arrow to assign values from the expression on the right hand side. It’s actually kind of unique to R.
The left arrow stems from S’s influence on R, which was in turn influenced by APL. APL was an acient language (in the eyes of the internet) that stems back from the 1960s. The APL language aptly named “A Programming Language” even allowed programmers to utilize special characters on IBM keyboards like this one. In APL “=” is reserved for equality checks. In R however, “=” can also be utilized for assigning values to variables, so I’m not quite sure why “<-” is still commonly used. Some sources online cite clarity as a reason, since R utilizes “==” for equality checks. Nevertheless, this doesn’t address the clumsy and complex appearance of nesting functions while processing data, or repeatedly applying and assigning return values from functions.
Cue pipe operators. The most popular being the forward pipe operator. Its creation can be traced back to a stackoverflow post, from an anonymous user asking how F#’s forward pipe operator could be implemented in R. Fast forward a couple of years and magrittr was released. Magrittr allowed the use of the forward pipe operator in the form %>%, later on we’ll see how this forward pipe operator can reduce the number of lines in your code, and increase clarity.
In this post I’ll introduce a few other pipe operators which address all of these issues. I’ll demonstrate their practicality, and illustrate how they can simplify your data processing work-flow, and make it more aesthetically pleasing.
The most popular pipe operators in R are: %>%, %<>%, %T>%, and %$%. Which carry the titles of forward operator, compound assignment operator, tee operator and exposition operator respectively. The bulk of my focus here will be on %>% & %<>%.
If you’ve ever stared at R code written online attempting to decipher the order of nested functions and their arguments, you have probably felt at least a tinge of irritation. If you recall composite functions from elementary algebra, forward pipe operators work in a similar manner. Take the equation of the composition of g(x) and f(x) as: \([f \circ g](x)\) = f(g(x)). the left hand side can be understood as applying the function f() to g(x). You can extend this to more than two ‘function applications’, but before we get ahead of ourself let’s take a look at some examples in R. You have to load magrittr first to use these operators:
library(dplyr,warn.conflicts = F)
library(data.table, warn.conflicts = F, verbose = F)
I turned warn conflicts off, but just be aware that dplyr masks some functions in other packages such as ‘filter’ in stats. Let’s start with a very simple example, and then move to some real-world uses. Let’s take a vector z thats the sequence of numbers 1 through 8:
z <- seq(1:8)
Let’s say you need to find the square root of the mean of this vector. You could do it two ways without pipe operators:
print(sqrt(mean(z)))
## [1] 2.12132
Or
y <- mean(z)
print(sqrt(y))
## [1] 2.12132
With pipe operators you could accomplish it with significantly less steps, and more clarity:
z %>% mean() %>% sqrt() %>% print()
## [1] 2.12132
It’s critical to keep in mind the order of operations. In our first example, we utilized a nested expression applying inner functions first and going out for each successive function applied. Using the forward pipe operator, we simply go left to right. To illustrate the importance of this let us take a look at another example, let’s define a vector z:
z <- runif(8,min = 1, max = 5)
print(z)
## [1] 2.471216 4.012321 3.425334 4.584447 2.183128 2.078648 2.900104 3.187657
Let’s round the mean, and then mean the round:
print(round(mean(z)))
## [1] 3
print(mean(round(z)))
## [1] 3
Using the forward pipe operator, the same two values can be obtained thusly
z %>% mean() %>% round() %>% print()
## [1] 3
z %>% round() %>% mean() %>% print()
## [1] 3
What about functions with more than one argument? Let’s look at round, and round z to the nearest hundredth:
print(round(z,2))
## [1] 2.47 4.01 3.43 4.58 2.18 2.08 2.90 3.19
z %>% round(2) %>% print()
## [1] 2.47 4.01 3.43 4.58 2.18 2.08 2.90 3.19
The compound assignment operator isn’t in dplyr, but is available in magrittr.
library(magrittr)
It pipes an object forward into an expression and reassigns the resulting object back. Let’s take a look one of our previous examples:
z %>% mean() %>% round() %>% print()
## [1] 3
The resulting value is displayed, but it isn’t actually assigned to any data object. If we want to forward pipe z into mean and round, and re-assign the resulting value back to z, we use the compound assignment operator first, then the forward pipes for each additional function:
z %<>% mean() %>% round()
print(z)
## [1] 3
To illustrate the robustness of these operators, let’s take dataframe of student attributes and course results. You can find this dataframe, and more information about it here on the UCI repository.
students.df <- read.csv("../../../../Downloads/student-mat.csv", header = T, stringsAsFactors = T,sep = ";")
str(students.df)
## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
Let’s imagine you wish to reduce the number of variables in your dataframe, and you only wish to view these variables for a particular school. You can use the forward pipe operator along with the select and filter function from dplyr.
Let’s select the three course grades along with the school, sex, and age.
students.reduced <- students.df %>% select(school, sex, age, G1, G2, G3)
head(students.reduced,12)
## school sex age G1 G2 G3
## 1 GP F 18 5 6 6
## 2 GP F 17 5 5 6
## 3 GP F 15 7 8 10
## 4 GP F 15 15 14 15
## 5 GP F 16 6 10 10
## 6 GP M 16 15 15 15
## 7 GP M 16 12 12 11
## 8 GP F 17 6 5 6
## 9 GP M 15 16 18 19
## 10 GP M 15 14 15 15
## 11 GP F 15 10 8 9
## 12 GP F 15 10 12 12
Now lets assume you only wish to look at male students that are age 17 or older. We can use the compound assignment operator pipe students.reduced into filter, and assign it back to the students.reduced object. Like so:
students.reduced %<>% filter(sex == "M" , age >= 17)
head(students.reduced,12)
## school sex age G1 G2 G3
## 1 GP M 17 6 5 5
## 2 GP M 17 8 8 10
## 3 GP M 17 9 7 8
## 4 GP M 18 7 4 0
## 5 GP M 17 10 0 0
## 6 GP M 17 5 0 0
## 7 GP M 18 6 5 0
## 8 GP M 19 5 0 0
## 9 GP M 17 16 12 13
## 10 GP M 17 7 6 0
## 11 GP M 17 10 10 10
## 12 GP M 17 5 8 7
And the data of interest is immediately reassigned to the students.reduced dataframe object. Let’s do something a little more complex, and find the average of each grade for male students 17 years or older. With forward pipe operators, this is trivial!
students.df %>% filter(sex == "M" | age >=17) %>% select(G1,G2,G3) %>% lapply(mean) %>% print()
## $G1
## [1] 11.14191
##
## $G2
## [1] 10.82838
##
## $G3
## [1] 10.50165
With a Tee pipe operator, you can pipe an object into two functions, granted that the function immediately after the the tee pipe does not return an object. Let’s get a scatter plot of age vs grade 1, and calculate the correlation matrix all with one line of code.
students.df %>% select(age, G1) %T>% plot() %>% cor()
## age G1
## age 1.0000000 -0.0640815
## G1 -0.0640815 1.0000000
The exposition operator allows you to pass only certain components of the left hand object to the right hand side by using the names of the components as arguments. This operator is helpful if you wish to select what columns of a data frame to pipe into a function. Take the following example:
students.df %>% filter(G1 > G2) %$% cor(age, G1)
## [1] 0.02727669
There are countless examples I could use to illustrate the robustness of pipe operators, but I thought I’d just write a quick primer about them with some elementary use cases. Hopefully I have convinced you to start using them in your day-to-day analysis.