Applied Data Science and Visualization > Week 4 Linear Regression for Data Science > Basics of dplyr

Basics of dplyr

Introduction

Within the tidyverse, one of the most useful tools of exploration and manipulation is the dplyr, labelled as the grammar of data manipulation. Since this course is geared fundamentally towards the process of data analysis and visualization, rather than basic programming in R, this tab is to provide a overview/review of the core aspects of dplyr which are great to use in Practical 4 (Linear Regression ++) as well as other practicals and assignments.

It should be noted that this tab is produced and based around dplyr 0.8.3, and not around dplyr 1.0.0 (released May 15th 2020), as such some minor differences exist, however not to the extent of requiring major changes to this tab after this scheduled release date.

One useful resource is the dplyr cheat sheet, this cheat sheet provides a really useful overview of the following sections, however this will aim to provide a much more expanded overview!

Furthermore, those completely new to dplyr are suggested to review: data import chapter in R for Data Science.

Installation & Loading

Since dplyr is a part of the tidyverse the easiest method of installing it is through having installed the tidyverse. Which given you have completed previous practicals should already be in your package library to be called. A useful way to view this process through the following meme:

If you haven’t installed the tidyverse yet complete the following:

install.packages("tidyverse")

If you have installed the tidyverse simply call it into your project:

library(tidyverse)

This now, alongside other packages within the tidyverse; are now available to use!

In order to remain consistent within this tab, manipulations will be made to one specific external dataset. This is a dataset which has been introduced and imported in the previous tab, An introduction to the Ins and Outs of Importing Data, that is data from the online data resource The World Bank; an online open databank provided global insights from population, Life expectancy, GDP and many other indicators of global development. These indicators exist typically from the 1960’s up to 2019/2020, and are incredibly useful for data analysis as they all follow a similar/typical data file format.

Please note; this dataset should not be used for Assignment 1, however you are welcome to combine and use these techniques and create your own dataset through following these steps with different indicators on The World Bank.

Variable Introduction

For those not familiar with the previous tab, which introduced this data. The data used will be the GDP from a large amount of countries between 1960 - 2014 (2019 is included however data is missing between 2015 - 2019). This data set will be referred to as gdp.dat, and to learn how to import it directly from the World Bank can be found in the previous tab.

The Core of dplyr

At the heart of the dplyr package, several core functions exist within data manipulation:

mutate() - create a new variable, which are functions of existing variables.
select() - picks variables based upon their names
filter() - picks cases based on their values
summarise() - reduces values down to a single summary
arrange() - changes the ordering of rows.

There are plenty of other functions within the package, to read more check out the dplyr reference manual.

mutate()

In some situations of data analysis, you would like to create a new variable which is the product of one of more other variables. For example, combining two variables (through a typical mathematical operator), or applying another function to a variable (such as the log, mean or such), this can be completed through using the mutate function, to take one of more variables or observations and producing a new variable as a result.

The general syntax of the mutate function is:

mutate(data_frame, name of new_variable = [function of existing variable])

For example, let us consider we would like to determine the average GDP of a nation for some research based purpose, say to see which nation has the highest average GDP over a specific time period. Using the syntax previously highlighted we can observe the following:

# Location of new variable 
           # mutate function
                  # dataset 
                           # new variable name
                                     # function of the new variable
  gdp.dat <- mutate(gdp.dat, gdp_ave = round(rowMeans(gdp.dat[44:59], na.rm = TRUE), 3))

The function used in this case if taking the years 2000 to 2014, and creating an average of their GDP (rowMeans() function), which is rounded to 3 decimal places (using the round() function). Which is then added into the specified dataset.

As a whole, mutate() is an incredibly useful tool in making the most of the data you have without having to work outside of R in other programs such as Excel, SPSS or SAS. And especially if these are quick function, such as determining the average of all the rows, or changing specific values (such as flipped values).

select()

Very often when first handling datasets (such as this one) there is often only a limited amount of data you need/require for your analysis, whether that is there are values which are not useful or simply a lot of junk variables which are simply not relevant to the work you are doing. As such, select() allows you to choose variables and either isolate them, or move them accordingly. It should be noted that there is a huge amount this function can do so see the specific reference page for more.

Firstly, lets start with the simple selection of variables, within our current data set, lets say you would like to only observe variables between 2000 and 2014. Due to this data set, you therefore would like to keep the following variables: Country Name, the years of 2000 to 2014 (inclusive) as well as the new mutated variable of the average during this time.

To do this you would use the following code:

gdp.21st <- select(gdp.dat, `Country Name`, `2000`:`2014`, `gdp_ave`)

Breaking down this code, you can see that you initially specify with the data (in this case gdp.dat), before listing the other variables you want, in the order you want them displayed. What is great about this function is its simplicity, since after specifying your data you simply need to refer to variables within that dataset to include; a helpful hint here is that to ensure that the name of the variable is correctly run, surround each variable name with “``” as is done here, since this reduces the confusion caused by spaces or numerical values provided.

Alongside providing just specific names of your variables, there are other functions which can be used within this function. As demonstrated the operator : can be used, alongside c(). However there are also some select() specific functions, which are incredibly useful.

For example, say we would simply like to bring a variable to the front of the order, this can be completed (for the variable gdp.ave), through simply changing the location of gdp_ave to the front before finishing the line with everything() indicating that you would like all other variables to follow from here. After running this code you will see this value has moved to the front of the data set.

gdp.21st <- select(gdp.21st, `Country Name`, `gdp_ave`, everything())

An additional nifty trick is if you have a huge number of variables in your list (like we do here), say we are looking to only examine GDP at the beginning of specific decades, or during one specific decade. This can be done using the operator : however can also be done using the ends_with() or starts_with() function as shown:

# To select all GDPs from the beginning of each decade
  gdp.decades <- select(gdp.dat, `Country Name`, ends_with("0"))

# To select all GDPs from the 1990s
  gdp.1990s <- select(gdp.dat, `Country Name`, starts_with("199"))

There are also several other useful functions, with more information being found here

An extension of the select() is the rename() function. This as its name suggests, simply allows the user to rename specific columns within the presented dataset.

For example let us return to our GDP dataset, and say we wish to rename our mutated variable from earlier, from gdp_ave to GDP Average. To do this we would use this rename() function, this requires three main components:

A specific dataframe
The new name it should be changed to
The column name to be changed

gdp.21st <- rename(gdp.21st, "GDP Average" = gdp_ave)

This similar to the select() function can also use those additional and unique functions inside of it, such as:

starts_with()/ends_with()/contains() to change only those variables which start with, end with or simply contain a specific character/number
everything() to rename everything
c() to specify a select number of items together.

As a whole, the select() (& rename()) function are useful for the direct manipulation of an overall dataset, these changes however tend to be very general in nature, and unlike other functions like filter(), only impact the overall groups and variables rather than individual observations.

filter()

Unlike select() the function filter() is more flexible in selecting data, as it aims to select rows (or cases) which meet the conditions of the functions. When using filter() there are several useful functions (and operators) which can be used to select data within the specified dataset:

== is exactly equal too
>/>= is less than (and less than or equal too) X
</<= is more than (and more than or equal too) X
& and (used for the inclusion of multiple conditions)
| or (used for the inclusion of multiple conditions)
! not
xor indicates an element wise exclusive OR
is.na() indicates which values hold the NA / missing value
between() indicates a max and min for the value to be between
near() indicates that a value should be near a value.

Lets look at some examples, let us once again use the gdp.21st dataset.

Firstly, let us start generally and consider pulling the data for one specific country, lets say the United Kingdom. As you can see this pulls the Country which is exactly named United Kingdom.

filter(gdp.21st, `Country Name` == "United Kingdom")

## # A tibble: 1 × 17
##   `Country Name` `GDP Average`  `2000`  `2001`  `2002`  `2003`  `2004`  `2005`
##   <chr>                  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 United Kingdom 2417615480465 1.67e12 1.65e12 1.79e12 2.06e12 2.42e12 2.54e12
## # … with 9 more variables: `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
## #   `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
## #   `2014` <dbl>

Breaking this function down, we can see the following components:

Dataset: gdp.21st
The variable in question: Country Name
The operator: == (exactly equal too)
The parameter linked to the operator: United Kingdom.

Lets now consider something more complex, let us consider which Countries in the Year 2010 have a NA value, or a value lower than 1 billion dollars.

filter(gdp.21st, is.na(`2010`) | `2010` < 1000000000)

## # A tibble: 31 × 17
##    `Country Name`  `GDP Average`  `2000`  `2001`  `2002`  `2003`  `2004`  `2005`
##    <chr>                   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 American Samoa     565769231. NA      NA       5.12e8  5.24e8  5.09e8  5   e8
##  2 Channel Islands   7944980687.  6.44e9  6.23e9  6.66e9  7.33e9  8.55e9  8.83e9
##  3 Comoros            743770628.  3.51e8  3.79e8  4.26e8  5.47e8  6.34e8  6.54e8
##  4 Curacao           3011508380. NA      NA      NA      NA      NA      NA     
##  5 Dominica           416977315.  3.33e8  3.40e8  3.33e8  3.43e8  3.67e8  3.64e8
##  6 Micronesia, Fe…    268661875   2.33e8  2.41e8  2.43e8  2.45e8  2.40e8  2.50e8
##  7 Gibraltar                NaN  NA      NA      NA      NA      NA      NA     
##  8 Guinea-Bissau      689378116.  3.71e8  3.93e8  4.18e8  4.77e8  5.32e8  5.87e8
##  9 Grenada            694177176.  5.20e8  5.20e8  5.40e8  5.91e8  5.99e8  6.96e8
## 10 Not classified           NaN  NA      NA      NA      NA      NA      NA     
## # … with 21 more rows, and 9 more variables: `2006` <dbl>, `2007` <dbl>,
## #   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
## #   `2013` <dbl>, `2014` <dbl>

Once more, if we break this down, it follows the same layout as before, however adds the extension using the | (or) operator. Similar layouts can be done for multiple conditions, using & (and).

As a function, filter() is incredibly diverse, as it allows the conditional extraction of variables from your dataset, which can the be used for anything from specific analysis through to visualization.

summarise()

When first investigating a dataset, it is sometimes required to summarise your data, this can be done through the summarise()/summarize() functions (they are synonyms but designed for those of you who prefer American-English spellings vs Traditional English).

This (like many tidyverse() functions) is really flexible in its usage, and can be used for/with grouped variables or simply single variables. Due to the usefulness of this function on grouped data, the dataset diamonds, will be used.

But in what ways can we summarise the data!!, I hear you call at your computers… (okay… I doubt it but still, I’ve got to make this engaging). Within summarise() you can use any typical array of functions in order to collect information on your data as a whole including, but not limited too:

Centering the Data: mean(), median()
Spread of the Data: sd(), IQR(), mad()
Range of the Data: min(), max(), quantile()
Position: first(), last(), nth()
Count: n(), n_distinct()
Logical: any(), all()

Let us first consider summarising the dataset as a whole, using the mean() function:

summarise(diamonds, 
          mean = mean(price),
          n = n())

## # A tibble: 1 × 2
##    mean     n
##   <dbl> <int>
## 1 3933. 53940

Breaking down this function, you can see that we have all the typical components of a tidyverse function,

The Dataset: diamonds
How you want to summarise the data and the column names for those new variables: mean = mean(price) & n = n().

This as you can see, produces the mean price of all the diamonds within the dataset.

But how about we split this into groups, say via the cut of the diamonds within the dataset. For this, a slightly different method can be used, which includes the use of pipes (%>%). Pipes work within R the same way in which they work in reality, through funneling the function down them, like so:

diamonds %>% 
  group_by(cut) %>%
  summarise(mean = mean(price), n = n())

## # A tibble: 5 × 3
##   cut        mean     n
##   <ord>     <dbl> <int>
## 1 Fair      4359.  1610
## 2 Good      3929.  4906
## 3 Very Good 3982. 12082
## 4 Premium   4584. 13791
## 5 Ideal     3458. 21551

From this, as can be seen, this groups the dataset diamonds by their cut and then summaries the mean price and number of diamonds within those groups. This of course can be expanded beyond a singular function, such as this, where we consider looking at the Standard Deviation (sd()), the minimum (min()) and maximum (max()) of the diamonds.

diamonds %>% 
  group_by(cut) %>%
  summarise(mean = mean(price), sd = sd(price), min = min(price), max = max(price),  n = n())

## # A tibble: 5 × 6
##   cut        mean    sd   min   max     n
##   <ord>     <dbl> <dbl> <int> <int> <int>
## 1 Fair      4359. 3560.   337 18574  1610
## 2 Good      3929. 3682.   327 18788  4906
## 3 Very Good 3982. 3936.   336 18818 12082
## 4 Premium   4584. 4349.   326 18823 13791
## 5 Ideal     3458. 3808.   326 18806 21551

And finally, these are not limited to only those functions which are inbuilt into R, with mathematical operators also being able to be used, within this function. Including:

diamonds %>% 
  group_by(cut) %>%
  summarise(mean = mean(price), n = n(), CI_UB = mean(price) + ((sd(price)/sqrt(n()))*1.96), CI_LB = mean(price) - ((sd(price)/sqrt(n()))*1.96))

## # A tibble: 5 × 5
##   cut        mean     n CI_UB CI_LB
##   <ord>     <dbl> <int> <dbl> <dbl>
## 1 Fair      4359.  1610 4533. 4185.
## 2 Good      3929.  4906 4032. 3826.
## 3 Very Good 3982. 12082 4052. 3912.
## 4 Premium   4584. 13791 4657. 4512.
## 5 Ideal     3458. 21551 3508. 3407.

Here you can see that the operators of +, -, /, * and sqrt() were used to calculate the Confidence intervals of the dataset provided.

Overall, within data analysis, this function can be incredibly helpful in generating specific summary statistics for your analysis, which are then easily able to be reported. Additionally with this function itself, it can easily be copied and pasted or produced into personal functions in order to make your work flow more efficient, when conducting repetitive analyses.

arrange()

Within the Rstudio it is possible to visualize your data as a dataframe and separate tab. Within this tab, it is possible to order and examine your data visually. These changes however are not permanent, and at times you may want to present data in a specific order, to do this, you can use the arrange() function.

Starting simply, say we would like to order the countries in our gdp.21st data frame by Average GDP, from smallest to largest. This can easily be achieved through the following code:

gdp.21st.arr <- arrange(gdp.21st, `GDP Average`)

This function is extremely simple, as it requires the dataset you would like to use (in this case gdp.21st) and at least one order parameter GDP Average in this case. However it is also possible for you to order by multiple variables.

Returning to our diamonds dataset, say we would order by price & carat, this simply requires the addition of the second parameter, in the order you wish for the data to be ordered:

dia.arr.1 <- arrange(diamonds, price, carat)

With the addition of the - operator to reverse the order applied for one or both variables listed:

dia.arr.2 <- arrange(diamonds, -price, carat)

Finally, it is also possible to arrange the data, within the groups themselves. To complete this:

Firstly, specify using pipes a new grouped variable

by_cut <- diamonds %>% group_by(cut)

Next, you can arrange it the groups specifically

dia.arr.3 <- by_cut %>% 
  arrange(price, .by_group =TRUE)

This arranges the groups (cut) individually by price.

Overall, the arrange() function is most useful when in combination with the other techniques covers, however is nonetheless useful within data manipulation.

Functions Technically not in dplyr but are REALLY helpful

Although the tidyverse is an incredibly useful and amazing resource of functions for everything from data importing through manipulation, analysis and visualization. There are sometimes tasks just as easily preformed using functions which exist outside of the typical tidyverse. This section will cover a small selection (I personally) think are incredibly useful in the manipulation of data.

These will be:

c() - for linking individual units
rbind()/cbind() - for combining R objects by Rows (rbind()) or Columns (cbind())
rownames()/colnames() - for naming rows/columns

Lets start off with the function c(), this function is undoubtedly simple, but can be a foundation to many useful and core functions. Put simply c() combine all values within it’s parameters into a vector, which can then be combined/converted into datasets later. For example, say we would like to track the number of goals scored by three football teams:

team1 <- c(1, 2, 4, 1, 0, 0, 1)
team2 <- c(0, 3, 0, 1, 0, 1, 2)
team3 <- c(4, 1, 0, 1, 2, 2, 3)

Once run, this produces three data vectors, which contain these values.

Next let us consider rbind() and cbind() these work in a similar way to c() but rather look at combining data either as rows (rbind()) or columns (cbind()), forming datasets as a results. Lets now produce a pair of datasets using rbind() and cbind().

match.wide <- rbind(team1, team2, team3)
match.long <- cbind(team1, team2, team3)

These can then be converted into dataframes using the function: as.data.frame(), so that they can be manipulated using the dplyr functions.

match.wide <- as.data.frame(match.wide)
match.long <- as.data.frame(match.long)

Finally, if you have produced a data frame from scratch, you may wish to name your variables or observations accordingly, as this makes them more accessible in the future. To do this, it is possible to use the rename() function, or the rownames()/colnames(). These functions can be slightly easier to use if you are changing all/most of a data frames labels, since you typically list them all anyhow.

Firstly we can rename, match.wide labeling each variable as a specific match, and the teams accordingly.

colnames(match.wide) <- c("Match 1", "Match 2", "Match 3", "Match 4", "Match 5", "Match 6", "Match 7")
rownames(match.wide) <- c("Team 1", "Team 2", "Team 3")

Similarly we can do the same for match.long but naming the variables as teams and the observations as matches.

colnames(match.long) <- c("Team 1", "Team 2", "Team 3")
rownames(match.long) <- c("Match 1", "Match 2", "Match 3", "Match 4", "Match 5", "Match 6", "Match 7")

At this point, you may be asking why is this dataset being created in two separate ways, which way is better. To that you should always examine what you are looking to gain from your data.

Since for example, let us consider our data we have just created and combined. Say we are looking at the differences between teams for a set number of matches, for this we would be most likely to use matches.long, since we are looking for differences between the Teams variable, whereas, by contrast if we were looking for differences in the matches variable, we would consider the matches.wide dataset. Although this is only a rule of thumb the easy way to consider it is:

Columns denote Variables (such as Teams, Age, Gender, Time, Price etc).
Rows denote Observations of the Variables.

To learn more about transforming dataframes, explore the other packages within the tidyverse to learn about other skills and the way they use them effectively.