Introduction
Within the tidyverse, one of the most useful tools of exploration and manipulation is the dplyr, labelled as the grammar of data manipulation. Since this course is geared fundamentally towards the process of data analysis and visualization, rather than basic programming in R, this tab is to provide a overview/review of the core aspects of dplyr
which are great to use in Practical 4 (Linear Regression ++) as well as other practicals and assignments.
It should be noted that this tab is produced and based around dplyr 0.8.3
, and not around dplyr 1.0.0
(released May 15th 2020), as such some minor differences exist, however not to the extent of requiring major changes to this tab after this scheduled release date.
One useful resource is the dplyr cheat sheet, this cheat sheet provides a really useful overview of the following sections, however this will aim to provide a much more expanded overview!
Furthermore, those completely new to dplyr
are suggested to review: data import chapter in R for Data Science.
Installation & Loading
Since dplyr
is a part of the tidyverse
the easiest method of installing it is through having installed the tidyverse
. Which given you have completed previous practicals should already be in your package library to be called. A useful way to view this process through the following meme:
If you haven’t installed the tidyverse
yet complete the following:
install.packages("tidyverse")
If you have installed the tidyverse
simply call it into your project:
library(tidyverse)
This now, alongside other packages within the tidyverse
; are now available to use!
In order to remain consistent within this tab, manipulations will be made to one specific external dataset. This is a dataset which has been introduced and imported in the previous tab, An introduction to the Ins and Outs of Importing Data, that is data from the online data resource The World Bank; an online open databank provided global insights from population, Life expectancy, GDP and many other indicators of global development. These indicators exist typically from the 1960’s up to 2019/2020, and are incredibly useful for data analysis as they all follow a similar/typical data file format.
Please note; this dataset should not be used for Assignment 1, however you are welcome to combine and use these techniques and create your own dataset through following these steps with different indicators on The World Bank.
Variable Introduction
For those not familiar with the previous tab, which introduced this data. The data used will be the GDP from a large amount of countries between 1960 - 2014 (2019 is included however data is missing between 2015 - 2019). This data set will be referred to as gdp.dat
, and to learn how to import it directly from the World Bank can be found in the previous tab.
The Core of dplyr
At the heart of the dplyr
package, several core functions exist within data manipulation:
mutate()
- create a new variable, which are functions of existing variables.select()
- picks variables based upon their namesfilter()
- picks cases based on their valuessummarise()
- reduces values down to a single summaryarrange()
- changes the ordering of rows.There are plenty of other functions within the package, to read more check out the dplyr reference manual.
mutate()
In some situations of data analysis, you would like to create a new variable which is the product of one of more other variables. For example, combining two variables (through a typical mathematical operator), or applying another function to a variable (such as the log, mean or such), this can be completed through using the mutate function, to take one of more variables or observations and producing a new variable as a result.
The general syntax of the mutate function is:
mutate(data_frame, name of new_variable = [function of existing variable])
For example, let us consider we would like to determine the average GDP of a nation for some research based purpose, say to see which nation has the highest average GDP over a specific time period. Using the syntax previously highlighted we can observe the following:
# Location of new variable
# mutate function
# dataset
# new variable name
# function of the new variable
gdp.dat <- mutate(gdp.dat, gdp_ave = round(rowMeans(gdp.dat[44:59], na.rm = TRUE), 3))
The function used in this case if taking the years 2000 to 2014, and creating an average of their GDP (rowMeans()
function), which is rounded to 3 decimal places (using the round()
function). Which is then added into the specified dataset.
As a whole, mutate()
is an incredibly useful tool in making the most of the data you have without having to work outside of R
in other programs such as Excel, SPSS or SAS. And especially if these are quick function, such as determining the average of all the rows, or changing specific values (such as flipped values).
select()
Very often when first handling datasets (such as this one) there is often only a limited amount of data you need/require for your analysis, whether that is there are values which are not useful or simply a lot of junk variables which are simply not relevant to the work you are doing. As such, select()
allows you to choose variables and either isolate them, or move them accordingly. It should be noted that there is a huge amount this function can do so see the specific reference page for more.
Firstly, lets start with the simple selection of variables, within our current data set, lets say you would like to only observe variables between 2000 and 2014. Due to this data set, you therefore would like to keep the following variables: Country Name
, the years of 2000
to 2014
(inclusive) as well as the new mutated variable of the average during this time.
To do this you would use the following code:
gdp.21st <- select(gdp.dat, `Country Name`, `2000`:`2014`, `gdp_ave`)
Breaking down this code, you can see that you initially specify with the data (in this case gdp.dat
), before listing the other variables you want, in the order you want them displayed. What is great about this function is its simplicity, since after specifying your data you simply need to refer to variables within that dataset to include; a helpful hint here is that to ensure that the name of the variable is correctly run, surround each variable name with “``” as is done here, since this reduces the confusion caused by spaces or numerical values provided.
Alongside providing just specific names of your variables, there are other functions which can be used within this function. As demonstrated the operator :
can be used, alongside c()
. However there are also some select()
specific functions, which are incredibly useful.
For example, say we would simply like to bring a variable to the front of the order, this can be completed (for the variable gdp.ave
), through simply changing the location of gdp_ave
to the front before finishing the line with everything()
indicating that you would like all other variables to follow from here. After running this code you will see this value has moved to the front of the data set.
gdp.21st <- select(gdp.21st, `Country Name`, `gdp_ave`, everything())
An additional nifty trick is if you have a huge number of variables in your list (like we do here), say we are looking to only examine GDP at the beginning of specific decades, or during one specific decade. This can be done using the operator :
however can also be done using the ends_with()
or starts_with()
function as shown:
# To select all GDPs from the beginning of each decade
gdp.decades <- select(gdp.dat, `Country Name`, ends_with("0"))
# To select all GDPs from the 1990s
gdp.1990s <- select(gdp.dat, `Country Name`, starts_with("199"))
There are also several other useful functions, with more information being found here
An extension of the select()
is the rename()
function. This as its name suggests, simply allows the user to rename specific columns within the presented dataset.
For example let us return to our GDP dataset, and say we wish to rename our mutated variable from earlier, from gdp_ave
to GDP Average
. To do this we would use this rename()
function, this requires three main components:
gdp.21st <- rename(gdp.21st, "GDP Average" = gdp_ave)
This similar to the select()
function can also use those additional and unique functions inside of it, such as:
starts_with()
/ends_with()
/contains()
to change only those variables which start with, end with or simply contain a specific character/numbereverything()
to rename everythingc()
to specify a select number of items together.As a whole, the select()
(& rename()
) function are useful for the direct manipulation of an overall dataset, these changes however tend to be very general in nature, and unlike other functions like filter()
, only impact the overall groups and variables rather than individual observations.
filter()
Unlike select()
the function filter()
is more flexible in selecting data, as it aims to select rows (or cases) which meet the conditions of the functions. When using filter()
there are several useful functions (and operators) which can be used to select data within the specified dataset:
==
is exactly equal too>
/>=
is less than (and less than or equal too) X<
/<=
is more than (and more than or equal too) X&
and (used for the inclusion of multiple conditions)|
or (used for the inclusion of multiple conditions)!
notxor
indicates an element wise exclusive ORis.na()
indicates which values hold the NA / missing valuebetween()
indicates a max and min for the value to be betweennear()
indicates that a value should be near a value.Lets look at some examples, let us once again use the gdp.21st
dataset.
Firstly, let us start generally and consider pulling the data for one specific country, lets say the United Kingdom. As you can see this pulls the Country which is exactly named United Kingdom.
filter(gdp.21st, `Country Name` == "United Kingdom")
## # A tibble: 1 × 17
## `Country Name` `GDP Average` `2000` `2001` `2002` `2003` `2004` `2005`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 United Kingdom 2417615480465 1.67e12 1.65e12 1.79e12 2.06e12 2.42e12 2.54e12
## # … with 9 more variables: `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
## # `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
## # `2014` <dbl>
Breaking this function down, we can see the following components:
gdp.21st
Country Name
==
(exactly equal too)United Kingdom
.Lets now consider something more complex, let us consider which Countries in the Year 2010 have a NA
value, or a value lower than 1 billion dollars.
filter(gdp.21st, is.na(`2010`) | `2010` < 1000000000)
## # A tibble: 31 × 17
## `Country Name` `GDP Average` `2000` `2001` `2002` `2003` `2004` `2005`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 American Samoa 565769231. NA NA 5.12e8 5.24e8 5.09e8 5 e8
## 2 Channel Islands 7944980687. 6.44e9 6.23e9 6.66e9 7.33e9 8.55e9 8.83e9
## 3 Comoros 743770628. 3.51e8 3.79e8 4.26e8 5.47e8 6.34e8 6.54e8
## 4 Curacao 3011508380. NA NA NA NA NA NA
## 5 Dominica 416977315. 3.33e8 3.40e8 3.33e8 3.43e8 3.67e8 3.64e8
## 6 Micronesia, Fe… 268661875 2.33e8 2.41e8 2.43e8 2.45e8 2.40e8 2.50e8
## 7 Gibraltar NaN NA NA NA NA NA NA
## 8 Guinea-Bissau 689378116. 3.71e8 3.93e8 4.18e8 4.77e8 5.32e8 5.87e8
## 9 Grenada 694177176. 5.20e8 5.20e8 5.40e8 5.91e8 5.99e8 6.96e8
## 10 Not classified NaN NA NA NA NA NA NA
## # … with 21 more rows, and 9 more variables: `2006` <dbl>, `2007` <dbl>,
## # `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
## # `2013` <dbl>, `2014` <dbl>
Once more, if we break this down, it follows the same layout as before, however adds the extension using the |
(or) operator. Similar layouts can be done for multiple conditions, using &
(and).
As a function, filter()
is incredibly diverse, as it allows the conditional extraction of variables from your dataset, which can the be used for anything from specific analysis through to visualization.
summarise()
When first investigating a dataset, it is sometimes required to summarise your data, this can be done through the summarise()
/summarize()
functions (they are synonyms but designed for those of you who prefer American-English spellings vs Traditional English).
This (like many tidyverse()
functions) is really flexible in its usage, and can be used for/with grouped variables or simply single variables. Due to the usefulness of this function on grouped data, the dataset diamonds
, will be used.
But in what ways can we summarise the data!!, I hear you call at your computers… (okay… I doubt it but still, I’ve got to make this engaging). Within summarise()
you can use any typical array of functions in order to collect information on your data as a whole including, but not limited too:
mean()
, median()
sd()
, IQR()
, mad()
min()
, max()
, quantile()
first()
, last()
, nth()
n()
, n_distinct()
any()
, all()
Let us first consider summarising the dataset as a whole, using the mean()
function:
summarise(diamonds,
mean = mean(price),
n = n())
## # A tibble: 1 × 2
## mean n
## <dbl> <int>
## 1 3933. 53940
Breaking down this function, you can see that we have all the typical components of a tidyverse
function,
diamonds
mean = mean(price)
& n = n()
.This as you can see, produces the mean price of all the diamonds within the dataset.
But how about we split this into groups, say via the cut of the diamonds within the dataset. For this, a slightly different method can be used, which includes the use of pipes (%>%
). Pipes work within R
the same way in which they work in reality, through funneling the function down them, like so:
diamonds %>%
group_by(cut) %>%
summarise(mean = mean(price), n = n())
## # A tibble: 5 × 3
## cut mean n
## <ord> <dbl> <int>
## 1 Fair 4359. 1610
## 2 Good 3929. 4906
## 3 Very Good 3982. 12082
## 4 Premium 4584. 13791
## 5 Ideal 3458. 21551
From this, as can be seen, this groups the dataset diamonds
by their cut
and then summaries the mean price and number of diamonds within those groups. This of course can be expanded beyond a singular function, such as this, where we consider looking at the Standard Deviation (sd()
), the minimum (min()
) and maximum (max()
) of the diamonds.
diamonds %>%
group_by(cut) %>%
summarise(mean = mean(price), sd = sd(price), min = min(price), max = max(price), n = n())
## # A tibble: 5 × 6
## cut mean sd min max n
## <ord> <dbl> <dbl> <int> <int> <int>
## 1 Fair 4359. 3560. 337 18574 1610
## 2 Good 3929. 3682. 327 18788 4906
## 3 Very Good 3982. 3936. 336 18818 12082
## 4 Premium 4584. 4349. 326 18823 13791
## 5 Ideal 3458. 3808. 326 18806 21551
And finally, these are not limited to only those functions which are inbuilt into R
, with mathematical operators also being able to be used, within this function. Including:
diamonds %>%
group_by(cut) %>%
summarise(mean = mean(price), n = n(), CI_UB = mean(price) + ((sd(price)/sqrt(n()))*1.96), CI_LB = mean(price) - ((sd(price)/sqrt(n()))*1.96))
## # A tibble: 5 × 5
## cut mean n CI_UB CI_LB
## <ord> <dbl> <int> <dbl> <dbl>
## 1 Fair 4359. 1610 4533. 4185.
## 2 Good 3929. 4906 4032. 3826.
## 3 Very Good 3982. 12082 4052. 3912.
## 4 Premium 4584. 13791 4657. 4512.
## 5 Ideal 3458. 21551 3508. 3407.
Here you can see that the operators of +
, -
, /
, *
and sqrt()
were used to calculate the Confidence intervals of the dataset provided.
Overall, within data analysis, this function can be incredibly helpful in generating specific summary statistics for your analysis, which are then easily able to be reported. Additionally with this function itself, it can easily be copied and pasted or produced into personal functions in order to make your work flow more efficient, when conducting repetitive analyses.
arrange()
Within the Rstudio
it is possible to visualize your data as a dataframe and separate tab. Within this tab, it is possible to order and examine your data visually. These changes however are not permanent, and at times you may want to present data in a specific order, to do this, you can use the arrange()
function.
Starting simply, say we would like to order the countries in our gdp.21st
data frame by Average GDP, from smallest to largest. This can easily be achieved through the following code:
gdp.21st.arr <- arrange(gdp.21st, `GDP Average`)
This function is extremely simple, as it requires the dataset you would like to use (in this case gdp.21st
) and at least one order parameter GDP Average
in this case. However it is also possible for you to order by multiple variables.
Returning to our diamonds
dataset, say we would order by price
& carat
, this simply requires the addition of the second parameter, in the order you wish for the data to be ordered:
dia.arr.1 <- arrange(diamonds, price, carat)
With the addition of the -
operator to reverse the order applied for one or both variables listed:
dia.arr.2 <- arrange(diamonds, -price, carat)
Finally, it is also possible to arrange the data, within the groups themselves. To complete this:
by_cut <- diamonds %>% group_by(cut)
dia.arr.3 <- by_cut %>%
arrange(price, .by_group =TRUE)
This arranges the groups (cut) individually by price.
Overall, the arrange()
function is most useful when in combination with the other techniques covers, however is nonetheless useful within data manipulation.
Functions Technically not in dplyr
but are REALLY helpful
Although the tidyverse
is an incredibly useful and amazing resource of functions for everything from data importing through manipulation, analysis and visualization. There are sometimes tasks just as easily preformed using functions which exist outside of the typical tidyverse
. This section will cover a small selection (I personally) think are incredibly useful in the manipulation of data.
These will be:
c()
- for linking individual unitsrbind()
/cbind()
- for combining R objects by Rows (rbind()
) or Columns (cbind()
)rownames()
/colnames()
- for naming rows/columnsLets start off with the function c()
, this function is undoubtedly simple, but can be a foundation to many useful and core functions. Put simply c()
combine all values within it’s parameters into a vector, which can then be combined/converted into datasets later. For example, say we would like to track the number of goals scored by three football teams:
team1 <- c(1, 2, 4, 1, 0, 0, 1)
team2 <- c(0, 3, 0, 1, 0, 1, 2)
team3 <- c(4, 1, 0, 1, 2, 2, 3)
Once run, this produces three data vectors, which contain these values.
Next let us consider rbind()
and cbind()
these work in a similar way to c()
but rather look at combining data either as rows (rbind()
) or columns (cbind()
), forming datasets as a results. Lets now produce a pair of datasets using rbind()
and cbind()
.
match.wide <- rbind(team1, team2, team3)
match.long <- cbind(team1, team2, team3)
These can then be converted into dataframes using the function: as.data.frame()
, so that they can be manipulated using the dplyr functions.
match.wide <- as.data.frame(match.wide)
match.long <- as.data.frame(match.long)
Finally, if you have produced a data frame from scratch, you may wish to name your variables or observations accordingly, as this makes them more accessible in the future. To do this, it is possible to use the rename()
function, or the rownames()
/colnames()
. These functions can be slightly easier to use if you are changing all/most of a data frames labels, since you typically list them all anyhow.
Firstly we can rename, match.wide
labeling each variable as a specific match, and the teams accordingly.
colnames(match.wide) <- c("Match 1", "Match 2", "Match 3", "Match 4", "Match 5", "Match 6", "Match 7")
rownames(match.wide) <- c("Team 1", "Team 2", "Team 3")
Similarly we can do the same for match.long
but naming the variables as teams and the observations as matches.
colnames(match.long) <- c("Team 1", "Team 2", "Team 3")
rownames(match.long) <- c("Match 1", "Match 2", "Match 3", "Match 4", "Match 5", "Match 6", "Match 7")
At this point, you may be asking why is this dataset being created in two separate ways, which way is better. To that you should always examine what you are looking to gain from your data.
Since for example, let us consider our data we have just created and combined. Say we are looking at the differences between teams for a set number of matches, for this we would be most likely to use matches.long
, since we are looking for differences between the Teams variable, whereas, by contrast if we were looking for differences in the matches variable, we would consider the matches.wide
dataset. Although this is only a rule of thumb the easy way to consider it is:
To learn more about transforming dataframes, explore the other packages within the tidyverse
to learn about other skills and the way they use them effectively.