Edit this page

Applied Data Science and Visualization > Week 4 Linear Regression for Data Science > An introduction to the Ins and Outs of Importing Data

An introduction to the Ins and Outs of Importing Data

Although not a taught part of this course, it is useful to review how data can be imported, as often when visualizing and analyzing data; you will not be using data which is automatically found within an R package. Rather be required to access it externally and manipulate it within your R / Rstudio work space.

When importing data, multiple packages exist, both within the tidyverse and outside of it. Within the tidyverse there are an array of smaller packages which exist devoted to data importing, the three major packages are:

readr, used to import rectangular data files into R (like: csv, tsv & fwf)
haven, used to import data files from other statistical software packages, such as SAS, SPSS & Stata.
readxl, used to import data files from Microsoft Excel into R.

As mentioned, there are multiple packages outside of the tidyverse which can also be used if preferred, these include foreign, used to import data into R from other statistical software including minitab, S, SAS, SPSS and many others.

For the purposes of this side tab, those within the tidyverse will primary be used.

Data Import

As a whole, when importing data two methods can be used; firstly (and arguably the easier) method is through importing the data directly through the Rstudio GUI. This is simpler, however is often not practical for complex projects; for examples those online which require datasets which regularly change. The second method is through importing data through R code itself.

Personally, I would always recommend beginning any data importing using the first method (using Rstudio’s GUI) since this is typically easier at evaluating tricky bugs in unknown datasets. Before converting this to the second method, for deployment in complex applications. This tutorial will walk you through both methods, for one particular example.

Importing data using Rstudio GUI

Firstly, let us consider a dataset to be used, say we would like to import data from the online data resource The World Bank; this online open databank can be used to provide global insights from population, Life expectancy, GDP and many other indicators of global development. These indicators exist typically from the 1960’s up to 2019/2020, and are incredibly useful for data analysis as they all follow a similar/typical data file format. This data can be easily imported using the Rstudio GUI in the following steps:

Step 1: Selecting the Data.

Let us firstly find a dataset we would like to import into R for analysis. In this example, I have selected the GDP indicator. This displays the world’s GDP between 1960-2019, both as grouping by region and as the world as a whole.

Step 2: Downloading the Data.

From this page, we now download the data accordingly. For this example we would like the .csv file format. Which can be downloaded from here:

This will download a .zip file containing multiple .csv files. Once downloaded, simply unzip the folder, and thus decompressing it, to have access to the files.

Step 3: Moving to your Working Directory.

Note: This is not an essential step, however is useful for code reproducibility

Now, you should have an unzipped folder containing your data somewhere on your computer. To most easily interact with this data, it is advised you move this data into your working directory. This area is where your Rproject will first look when you apply any form of data search/loading function.

To determine where you working directory is, either:

1. Run the following code; this will give you the full location of your working directory.

getwd()

1. On your Rstudio GUI, on your panel which is topped as so, simply click More, and click Go to Working Directory which will then load the location of your working directory into the panel below.

Once you know the location of your working directory, drag and drop the desired .csv files into it, do not directly import the folder it is in as this will typically make it more complicated to access your data later!

If you have done this correctly, you should see your files like so, ready to be used.

Step 4: Importing the data.

So, now is the exciting part, importing your data directly. To do this, simply click the file you would like to import (in this case API_NY…), and click Import Dataset on the menu.

This will in turn open a new window in your Rstudio GUI, looking something like this:

In the current state as you can see, this doesn’t look like it is in a usable format. Since it provides no information on the GDP.

To overcome this, you need to change the Skip: value. This by definition, defines how many rows you would like the importing function to skips before defining the table, so at 3 skips it looks like this:

This now looks like more usable data. For datasets other than those from the world bank, you may need to adapt the settings as to determine how to import the data to be displayed in the way which you need. This may include changing any of the Import Options: as seen.

Within this Import options: section, you will also see a section labelled Name: this will automatically file to be the name of the file you are wishing to import. This in some cases will not be useful (as they are either super long like in this case) so be sure to change this before importing to save you time.

An incredibly useful aspect of this import window is the Code Preview: section, this illustrates the exact code which R will be running to import the data you have specified. This is incredibly useful (if like me) you like cutting corners to save you time in your coding. This can easily be copied to your clipboard using the copy button in the top right corner of that panel.

Step 5: Reviewing the data.

If you have followed these steps correctly you should now have a dataset within your Global Environment labelled with the name you have specified. In my case:

From here you can now begin to explore, interact and manipulate the data using the techniques covered in other sections and practicals on this course.

Importing Data using R Code alone

So when importing data, this method is incredibly useful if you already know everything about your data, for example, you know if you need to make any skips, changes and adaptions to the dataset you are importing. Typically within this course (and many other within University courses) this will be the most suitable, as limited/no adaptions will be required. However in the real world of data manipulation, visualization and coding, it is likely that a combination of trial and improvement, use of the GUI and knowledge of the data you are importing is required.

This being said, for complex applications of importing data, such as during longer projects, online applications and many others, where accessing the GUI is limited or not possible loading your data through R code itself is incredibly useful.

As such, this section will review how to import data using R code alone, using the knowledge we gained from importing the data using the GUI.

Please note: this tutorial will walk through how to code each step of the process, if you already have data within your working directory, you can simply skip ahead

Step 1: Selecting the Data.

As with the previous step, first we must identify the data we wish to import. This is identical to the GUI process. In this example, I have selected the GDP indicator. This displays the world’s GDP between 1960-2019, both as grouping by region and as the world as a whole.

Step 2: Downloading the Data.

From this page, we can now consider which data we could like to download. As before this can be downloaded manually from here:

But as we covered previously, this will download a .zip file containing multiple .csv files. As a result an additional step must be covered before we can access these files:

Firstly, we must create a temporary location in which to download the .zip file too, using the function tempfile().

temp.gdp <- tempfile()

Next, using the function download.file(), we can download files directly from URLs, these don’t specifically have to be .csv, .zip or any specific file type however, in this example we will be downloading a .zip file. Breaking down this function, we can observe that it requires:

url: a link to the online location of the data, this location is found through right clicking on the download link (in the case of the World Bank) and selecting copy link address.
destfile: this is the location which you wish the file to be downloaded too, in this case the temp file we have created: temp.gdp.
mode: this defines the mode which to write the file. This varies slightly and may require some trial and improvement depending on your operating system (this currently works on my MAC-OS machine). For binary, .zip files, we will be using the mode wb.

download.file(url = "http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv", 
              destfile = temp.gdp, 
              mode = "wb")

This (if successful) will now download this .zip data file, and list it under variables. Now to access the contents of this data, and the .csv files, use the function unzip().

unzip(temp.gdp)

This will now unpack your files into your working directory and be ready to be imported formally. It should be noted, for more complex arrangements the following sub-steps can be used. For more simple situations, skip ahead to Step 3.

Sub-step 2a

In some situations (such as automated situations and code), you may want to write code which is not dependent on your knowing the names of the variables (either due to variety/variance, for frequently changing data; or other applications). As such, the following advanced steps can be taken:

Firstly, when unzipping your data, set the variable names to a new variable, in this case temp.gdp.names (make sure to do this alongside the previous step).

temp.gdp.names <- unzip(temp.gdp)

This will now give you a list of your variables within the zip file. In the case of the World Bank Data sets, the actual names of the files will frequently change, however the structure will not. For example we know that within the .zip file it contains three files, one labelled API... and two labelled Metadata_... with us knowing that the file beginning API... is the one containing the data we need.

As such we can automate the selection of this variable through the following code:

# Step 1: Determine which file within the zip file begins with "./API" using `statsWith()`
  temp.gdp.names.bol <- startsWith(temp.gdp.names, "./API")

# Step 2: Identity where in the new list where this TRUE value is using `which()`.
  temp.gdp.names.val <- which(temp.gdp.names.bol == TRUE)

Since we now know the location of the .csv file we would like within the list of files, this can now be used to import the data (please note this process is the same as Step 3, but requires some additional adaptation).

The function read_csv(), is used to indicate that a .csv file will be read in from a specific location, typically your Working Directory. Through specifying our value (temp.gdp.names.val which in this case is equal to the numerical value 2, within the list of names (temp.gdp.names)), before applying our previous skip knowledge, it allows us to import the .csv file from the downloaded .zip file.

gdp.dat <- read_csv(temp.gdp.names[temp.gdp.names.val], 
              skip = 3)

Please note, this sub-step is a long way around to importing your data, with importing data being able to be done much easier through the final step here, or the GUI as explained above. However, as discussed there are some situations where this method is most useful & appropriate.

Step 3: Importing your data.

As seen within the previous walk through when using the GUI, it is possible for Rstudio to specifically generate code for you to import your data. Which (personally) is quite useful for long titled files which require importing. However, it is still possible to explain the required steps into importing data in this way. It should be noted that this is specifically for .csv files (typically a common file format to be seen within data analysis), however this technique can be applied (using a different function) to other file formats.

To import .csv, the read_csv() function can be used. Within this you are required to specify the name of the file you wish to import, alongside any additional parameters, such as skip.

Typically when reading in (another name for importing data), you should assign this a specific name, to ensure you have full control of the name of the variable used. Within this case, we can name the data frame we wish to use as gdp.dat.

gdp.dat <- read_csv("API_NY.GDP.MKTP.CD_DS2_en_csv_v2_988718.csv", 
              skip = 3)

## New names:
## Rows: 264 Columns: 65
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): Country Name, Country Code, Indicator Name, Indicator Code dbl (59): 1960,
## 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (2): 2019,
## ...65
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...65`

Now you have an introduction to importing data, this can be adapted and transferred to importing other data types for analysis and manipulation. In the following tab you will see how to begin to manipulate this data, before conducting analysis on it.