Operations:
-
- filtering for rows;
- selecting specific columns;
- re-ordering rows;
- adding new columns;
- summarizing data.
To install dplyr:
install.packages("dplyr")
To load dplyr:
library(dplyr)
Important functions
select() | select columns |
filter() | filter rows |
arrange() | re-order or arrange rows |
mutate() | create new columns |
summarise() | summarise values |
group_by() | allows for group operations in the “split-apply-combine” concept |
Examples:
Data set - iris
Iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width from each of 3 species of iris.
dim(iris)
## [1] 150 5
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Selecting Columns Using select()
Select a set of columns: the Sepal.Length and the Sepal.Width columns.
sepalData <- select(iris, Sepal.Length, Sepal.Width)
head(sepalData)
## Sepal.Length Sepal.Width
## 1 5.1 3.5
## 2 4.9 3.0
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5.0 3.6
## 6 5.4 3.9
To select all the columns except a specific column, use the “-” (subtraction) operator:
head(select(iris, -Sepal.Length))
## Sepal.Width Petal.Length Petal.Width Species
## 1 3.5 1.4 0.2 setosa
## 2 3.0 1.4 0.2 setosa
## 3 3.2 1.3 0.2 setosa
## 4 3.1 1.5 0.2 setosa
## 5 3.6 1.4 0.2 setosa
## 6 3.9 1.7 0.4 setosa
To select a range of columns by name, use the “:” (colon) operator:
head(select(iris, Petal.Length:Petal.Width))
To select all columns that start with the character string “pe”, use the function starts_with():
head(select(iris, starts_with("pe")))
Additional options to select columns:
ends_with() | Select columns that end with a character string |
contains() | Select columns that contain a character string |
matches() | Select columns that match a regular expression |
one_of() | Select column names that are from a group of names |
Selecting Rows Using filter()
Filter the rows for iris that have a length of sepal more than 6 centimeters.
head(filter(iris, Sepal.Length >= 6))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 7.0 3.2 4.7 1.4 versicolor
## 2 6.4 3.2 4.5 1.5 versicolor
## 3 6.9 3.1 4.9 1.5 versicolor
## 4 6.5 2.8 4.6 1.5 versicolor
## 5 6.3 3.3 4.7 1.6 versicolor
## 6 6.6 2.9 4.6 1.3 versicolor
Filter the rows for iris that have a length of sepal more than 6 centimeters and have a width of sepal more than 3 centimeters.
filter(iris, Sepal.Length >= 6, Sepal.Width >= 3)
Filter the rows for iris in setosa and versicolor.
filter(iris, Species %in% c("setosa", "versicolor"))
It is possible to use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to create the logical tests.
Pipe Operator: %>%
dplyr imports pipe operator(%>%) from package magrittr. This operator allows to pipe the output from one function to the input of another function.
iris %>% select(Sepal.Length, Sepal.Width) %>% head
Arranging Rows Using arrange()
To arrange rows by a particular column, list the name of the column you want to arrange the rows by:
iris %>% arrange(Sepal.Width) %>% head
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.0 2.0 3.5 1.0 versicolor
## 2 6.0 2.2 4.0 1.0 versicolor
## 3 6.2 2.2 4.5 1.5 versicolor
## 4 6.0 2.2 5.0 1.5 virginica
## 5 4.5 2.3 1.3 0.3 setosa
## 6 5.5 2.3 4.0 1.3 versicolor
Arranging the rows in the Sepal.Width column in a descending order.
iris %>% arrange(desc(Sepal.Width)) %>% head
Creating New Columns Using mutate()
The mutate() function will add new columns to the data frame. Create a new column called lws_proportion, which is the ratio of Sepal.Length to Sepal.Width.
iris %>% mutate(lws_proportion = Sepal.Length / Sepal.Width) %>% head
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species lws_proportion
## 1 5.1 3.5 1.4 0.2 setosa 1.457143
## 2 4.9 3.0 1.4 0.2 setosa 1.633333
## 3 4.7 3.2 1.3 0.2 setosa 1.468750
## 4 4.6 3.1 1.5 0.2 setosa 1.483871
## 5 5.0 3.6 1.4 0.2 setosa 1.388889
## 6 5.4 3.9 1.7 0.4 setosa 1.384615
You can create many new columns using mutate (separated by commas).
Creating summaries of the data frame using summarise()
The summarise() function will create summary statistics for a given column in the data frame such as finding the mean.
iris %>% summarise(avg_sepal_length = mean(Sepal.Length))
## avg_sepal_length
## 1 5.843333
There are many other summary statistics: sd(), min(), max(), median(), sum(), n() (returns the length of vector), first() (returns first value in vector), last() (returns last value in vector) and n_distinct() (number of distinct values in vector).
Group operations using group_by()
Splitting the data frame by some variable, applying a function to the individual data frames and then combining the output.
iris %>% group_by(Species) %>% summarise(avg_sepal_length = mean(Sepal.Length),
min_sepal_length = min(Sepal.Length), max_sepal_length = max(Sepal.Length),
total = n())
## # A tibble: 3 × 5
## Species avg_sepal_length min_sepal_length max_sepal_length total
## <fctr> <dbl> <dbl> <dbl> <int>
## 1 setosa 5.006 4.3 5.8 50
## 2 versicolor 5.936 4.9 7.0 50
## 3 virginica 6.588 4.9 7.9 50
Further reading:
dplyr