Operations:

  1. filtering for rows;
  2. selecting specific columns;
  3. re-ordering rows;
  4. adding new columns;
  5. summarizing data.

To install dplyr:

install.packages("dplyr")

To load dplyr:

library(dplyr)

Important functions

select() select columns
filter() filter rows
arrange() re-order or arrange rows
mutate() create new columns
summarise() summarise values
group_by() allows for group operations in the “split-apply-combine” concept

Examples:

Data set - iris

Iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width from each of 3 species of iris.

dim(iris)
## [1] 150   5
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Selecting Columns Using select()

Select a set of columns: the Sepal.Length and the Sepal.Width columns.

sepalData <- select(iris, Sepal.Length, Sepal.Width)
head(sepalData)
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6
## 6          5.4         3.9

To select all the columns except a specific column, use the “-” (subtraction) operator:

head(select(iris, -Sepal.Length))
##   Sepal.Width Petal.Length Petal.Width Species
## 1         3.5          1.4         0.2  setosa
## 2         3.0          1.4         0.2  setosa
## 3         3.2          1.3         0.2  setosa
## 4         3.1          1.5         0.2  setosa
## 5         3.6          1.4         0.2  setosa
## 6         3.9          1.7         0.4  setosa

To select a range of columns by name, use the “:” (colon) operator:

head(select(iris, Petal.Length:Petal.Width))

To select all columns that start with the character string “pe”, use the function starts_with():

head(select(iris, starts_with("pe")))

Additional options to select columns:

ends_with() Select columns that end with a character string
contains() Select columns that contain a character string
matches() Select columns that match a regular expression
one_of() Select column names that are from a group of names

Selecting Rows Using filter()

Filter the rows for iris that have a length of sepal more than 6 centimeters.

head(filter(iris, Sepal.Length >= 6))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          6.5         2.8          4.6         1.5 versicolor
## 5          6.3         3.3          4.7         1.6 versicolor
## 6          6.6         2.9          4.6         1.3 versicolor

Filter the rows for iris that have a length of sepal more than 6 centimeters and have a width of sepal more than 3 centimeters.

filter(iris, Sepal.Length >= 6, Sepal.Width >= 3)

Filter the rows for iris in setosa and versicolor.

filter(iris, Species %in% c("setosa", "versicolor"))

It is possible to use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to create the logical tests.

Pipe Operator: %>%

dplyr imports pipe operator(%>%) from package magrittr. This operator allows to pipe the output from one function to the input of another function.

iris %>% select(Sepal.Length, Sepal.Width) %>% head

Arranging Rows Using arrange()

To arrange rows by a particular column, list the name of the column you want to arrange the rows by:

iris %>% arrange(Sepal.Width) %>% head
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.0         2.0          3.5         1.0 versicolor
## 2          6.0         2.2          4.0         1.0 versicolor
## 3          6.2         2.2          4.5         1.5 versicolor
## 4          6.0         2.2          5.0         1.5  virginica
## 5          4.5         2.3          1.3         0.3     setosa
## 6          5.5         2.3          4.0         1.3 versicolor

Arranging the rows in the Sepal.Width column in a descending order.

iris %>% arrange(desc(Sepal.Width)) %>% head

Creating New Columns Using mutate()

The mutate() function will add new columns to the data frame. Create a new column called lws_proportion, which is the ratio of Sepal.Length to Sepal.Width.

iris %>% mutate(lws_proportion = Sepal.Length / Sepal.Width) %>% head
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species lws_proportion
## 1          5.1         3.5          1.4         0.2  setosa       1.457143
## 2          4.9         3.0          1.4         0.2  setosa       1.633333
## 3          4.7         3.2          1.3         0.2  setosa       1.468750
## 4          4.6         3.1          1.5         0.2  setosa       1.483871
## 5          5.0         3.6          1.4         0.2  setosa       1.388889
## 6          5.4         3.9          1.7         0.4  setosa       1.384615

You can create many new columns using mutate (separated by commas).

Creating summaries of the data frame using summarise()

The summarise() function will create summary statistics for a given column in the data frame such as finding the mean.

iris %>% summarise(avg_sepal_length = mean(Sepal.Length))
##   avg_sepal_length
## 1         5.843333

There are many other summary statistics: sd(), min(), max(), median(), sum(), n() (returns the length of vector), first() (returns first value in vector), last() (returns last value in vector) and n_distinct() (number of distinct values in vector).

Group operations using group_by()

Splitting the data frame by some variable, applying a function to the individual data frames and then combining the output.

iris %>% group_by(Species) %>% summarise(avg_sepal_length = mean(Sepal.Length), 
              min_sepal_length = min(Sepal.Length), max_sepal_length = max(Sepal.Length),
              total = n())
## # A tibble: 3 × 5
##      Species avg_sepal_length min_sepal_length max_sepal_length total
##       <fctr>            <dbl>            <dbl>            <dbl> <int>
## 1     setosa            5.006              4.3              5.8    50
## 2 versicolor            5.936              4.9              7.0    50
## 3  virginica            6.588              4.9              7.9    50

Further reading:

dplyr