This tutorial will go over some basics about programming in R. We hope that you’ll get a feel for what R can do as well as learn where you can learn more to use it on your own (great resources are listed at the end).
Although R is technically a programming language, it was developed specifically for analyzing data. It has many built-in tools for common analysis tasks, as well as countless more available for add-on that have been developed by the R community. Since it is a language, however, you are not limited to carrying out tasks or analyses that someone has already implemented. R is widely-used, free, and open-source with great community support.
If you haven’t already installed R and RStudio on your computer, please follow these instructions to get set up.
To set up your own computer, the first step is to install R. You can download and install R from the Comprehensive R Archive Network (CRAN). It is relatively straightforward, but if you need further help you can try the following resources:
The next step is to install RStudio, a program for viewing and running R scripts. Technically you can run all the code shown here without installing RStudio, but we highly recommend this integrated development environment (IDE). Instructions are here and for Windows we have special instructions.
Now that you have opened up RStudio, you are ready to start working with data. Whichever approach you are using to interact with R, you should identify the console.
When you type a line of code into the consult and hit enter the command gets executed. For example, try using R as a calculator by typing:
2 + 3
## [1] 5
We can also assign values to variables. Try the following:
x <- 2
y <- 3
x + y
## [1] 5
Note also the window above the console. This is where you can store lines of code to be executed in a particular sequence, and these can be saved in a script (a text file with a “.R” extension) or as an R Markdown notebook (with a “.Rmd” extension) so that you can reproduce your results later or run your code on another dataset.
When you download R from CRAN you get what we call base R. This includes several functions that are considered fundamental for data analysis. It also includes several example datasets. These datasets are particularly useful as examples when we are learning to use the available functions. You can see all the available dataset by executing the function data
like this:
data()
Because in R functions are objects, we need the two parenthesis to let R know that we want the function to be executed as opposed to showing us the code for the function. Type the following and note the difference:
data
To see an example of functions at work, we will use to co2
dataset to illustrate the function plot
, one of the base functions. We can plot Mauna Loa Atmospheric CO2 Concentration over time like this:
plot(co2)
Note that R’s base functionality is bare bones. Note that data science applications are broad, the statistical toolbox is extensive, and most users need only a small fraction of all the available functionality. Therefore, a better approach is to make specific functionality available on demand, just like apps for smartphones. R does this using packages, also called libraries.
Some packages are important enough that they are included with the base download. This includes, for example, the survival
package which implements core methods for survival analysis in R. To bring that functionality to your current session we type:
library(survival)
There are currently two major repositories for R packages, one maintained by CRAN with over 14,000 packages and a second maintained by the Bioconductor project with over 1,700 packages for biological data analysis in R. Packages on CRAN can be installed using the install.packages
function. (Installing packages from Bioconductor requires a couple additional steps and is not necessary for this hackathon. More details are available on the Bioconductor website.)
To use an add-on package that is not included with base R, you’ll first need to install it. The first R command we will run is install.packages
. Packages can be retrieved from several different repositories. As noted above, the most popular repository is CRAN, where packages are vetted: they are checked for common errors and they must have a dedicated maintainer. There are other repositories, some with more vetting, such as Bioconductor, and no vetting, such as GitHub (yes, you can maintain your R package on GitHub!). You can easily install CRAN packages from within R if you know the name of the packages. As an example, if you want to install the package dplyr
, you would use:
install.packages("dplyr")
This step only needs to be carried out once on your machine. We can then load the package into our R sessions using the library
function:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
This step will need to be carried out during every new R session before using the package. If you ever try to load a package with the library
command and get an error, it probably means you need to install it first. Note that there are reasons to reinstall packages that already exist in your library (e.g., to receive updated versions of packages).
A key feature you need to know about R is that you can get help for a function using help
or ?
, like this:
?install.packages
help("install.packages")
These pages are quite detailed about what the function expects as input and what the function produces as output. They also include helpful examples of how to use the function at the end.
Although there are different styles and languages of programming, in essence a piece of code is just a very detailed set of instructions. Each language has its own set of rules and syntax. According to Wikipedia, syntax is
“the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language.”
Here are some general tips and pitfalls to avoid that will be useful when writing R code
1. Case matters: variable names, keywords, functions, and package names are all case-sensitive
x <- 2
X + 8
## Error in X + 8: non-numeric argument to binary operator
2. Avoid using spaces: variable names cannot contain spaces
my variable <- 10
## Error: <text>:1:4: unexpected symbol
## 1: my variable
## ^
3. Use comments liberally: your future self and others will thank you
# define scalar variables x and y
x <- 2
y <- 3
# add variables x and y
x + y
## [1] 5
4. Pay attention to classes: character strings, numerics, factors, matrices, lists, data.frames, etc., all behave differently in R
myNumber <- factor(10)
str(myNumber)
## Factor w/ 1 level "10": 1
myNumber^2
## Warning in Ops.factor(myNumber, 2): '^' not meaningful for factors
## [1] NA
as.numeric(myNumber)^2
## [1] 1
as.character(myNumber)^2
## Error in as.character(myNumber)^2: non-numeric argument to binary operator
as.numeric(as.character(myNumber))^2
## [1] 100
5. Search the documentation for answers: when something unexpected happens, try to find out why by reading the documentation
mean(c(3, 4, 5, NA))
## [1] NA
mean(c(3, 4, 5, NA), na.rm = TRUE)
## [1] 4
6. It’s OK to make mistakes: expert R programmers run into (and learn from) errors all the time
Don’t panic about those error messages!
The process of reading in data and getting it in a format for analysis is often called data “wrangling”. This step may seem simple to an outside observer, but often takes up a significant proportion of the time spent on a data analysis.
The first step when preparing to analyze data is to read in the data into R. There are several ways to do this, but we are going to focus on reading in data stored either as an external Comma-Separated Value (CSV) file or “serialized” R data (RDS) object. Small datasets are often stored as Excel files. Although there are R packages designed to read Excel (xls/xlsx) format, you generally want to avoid this. In addition to the values stored in each individual cell, Excel spreadsheets oftain contain information in annotations (e.g. bold, italics, colors). Parsing these additional annotations is messy and imperfect. Many of these additional “features” of Excel spreadsheets can cause profound headaches for downstream data analysis. For these reasons, it is almost always preferable to save data as a plain-text or R object file. Frequently, plain text files are saved in either comma delimited (CSV) or tab delimited (TSV) format.
Plain-text formats are often easiest for sharing, as commercial software is not required for viewing or working with the data. CSV files can be read into R with the help of the read.csv
function. Similarly, data.frames can be written to CSV files using the write.csv
function.
If your data is not a text file but not in CSV format, there are many other helpful functions that will read your data into R, such as read.table
, read.delim
, download.file
. Check out their help pages to learn more.
Another common format for storing data in R is the RDS
file format. Unlike plain-text files, RDS
files are binary files and cannot be opened and inspected using standard text editors. Instead, RDS
files must be read in to R using the readRDS
function. An object in R can be saved as an RDS
file using the appropriately named saveRDS
function. While RDS
files can only be read using R, they are almost always substantially smaller than the corresponding CSV or TSV file.
Throughout, we will be working with tables stored as RDS
files.
If you are reading in a file stored on your computer, the first step is to find the file containing your data and know its path.
When you are working in R it is useful to know your working directory. This is the directory or folder in which R will save or look for files by default. You can see your working directory thought the console by typing:
getwd()
You can also change your working directory using the function setwd
. Or you can change it through the RStudio menus by clicking on “Session”.
The functions that read and write files (there are several in R) assume you mean to look for files or write files in the working directory. Our recommended approach for beginners will have you reading and writing to the working directory. However, you can also type the full path, which will work independently of the working directory.
As an example, let’s read in one of the datasets we’ll be analyzing today:
rawFile <- file.path("..", "data", "rawPharmacoData.rds")
if (!file.exists(rawFile)) {
source(file.path("..", "downloadData.R"), chdir = TRUE)
}
pharmacoData <- readRDS(rawFile)
Once we have read a dataset into an object (here we called it pharmacoData
), we are ready to explore it. What exactly is in pharmacoData
? To check a summary of its contents, the str()
function is handy (which stands for ‘structure’).
str(pharmacoData)
## 'data.frame': 43427 obs. of 6 variables:
## $ cellLine : chr "22RV1" "22RV1" "22RV1" "22RV1" ...
## $ drug : chr "17-AAG" "17-AAG" "17-AAG" "17-AAG" ...
## $ doseID : chr "doses1" "doses2" "doses3" "doses4" ...
## $ concentration: num 0.0025 0.008 0.025 0.08 0.25 0.8 2.53 8 0.0025 0.008 ...
## $ viability : num 94.1 86 99.9 85 62 ...
## $ study : chr "CCLE" "CCLE" "CCLE" "CCLE" ...
Here we see that this object is a data.frame
. These are one of the most widely used data types in R. They are particularly useful for storing tables. We can also print out the top of the data frame using the head()
function
head(pharmacoData)
## cellLine drug doseID concentration viability study
## 1 22RV1 17-AAG doses1 0.0025 94.100 CCLE
## 2 22RV1 17-AAG doses2 0.0080 86.000 CCLE
## 3 22RV1 17-AAG doses3 0.0250 99.932 CCLE
## 4 22RV1 17-AAG doses4 0.0800 85.000 CCLE
## 5 22RV1 17-AAG doses5 0.2500 62.000 CCLE
## 6 22RV1 17-AAG doses6 0.8000 29.000 CCLE
Another option is to open up a ‘spreadsheet’ tab in another RStudio window, which can be done with the View
function:
View(pharmacoData)
There are many different data types in R, but a list of the more common ones include:
data.frame
vector
matrix
list
factor
character
numeric
integer
double
Each of them has their own properties and reading up on them will give you a better understanding of the underlying R infrastructure. See the respective help files for additional information. To see what type of class an object is one can use the class
function.
class(pharmacoData)
## [1] "data.frame"
To extract columns from the data.frame we use the $
character like this (to avoid printing the entire column to the screen, we’ll add the head
function to just print the top):
head(pharmacoData$drug)
## [1] "17-AAG" "17-AAG" "17-AAG" "17-AAG" "17-AAG" "17-AAG"
This now gives us a vector. We can access elements of the vector using the [
symbol. Here is the 5000th element of the vector:
pharmacoData$drug[5000]
## [1] "PD-0332991"
Vectors are a sequence of data elements of the same type (class). Many of the operations used to analyze data are applied to vectors. In R, vectors can be numeric, characters or logical.
The most basic way to creat a vector is with the function c
.
x <- c(1, 2, 3, 4, 5)
Two very common ways of generating vectors are using :
or the seq
function.
x <- 1:5
x <- seq(1, 5)
Vectors can have names.
names(x) <- letters[1:5]
x
## a b c d e
## 1 2 3 4 5
Up to now we have used prebuilt functions. However, many times we have to construct our own. We can do this in R using function
.
avg <- function(x) {
return(sum(x) / length(x))
}
avg(1:5)
## [1] 3
Material in this tutorial was adapted from Rafael Irizarry’s Introduction to Data Science course.
If you want to learn more about R after this event, a great place to start is with the swirl tutorial, which teaches you R programming interactively, at your own pace and in the R console. Once you have R installed, you can install swirl
and run it the following way:
install.packages("swirl")
library(swirl)
swirl()
There are also many open and free resources and reference guides for R. Two examples are:
Standard texts
Chambers (2008). Software for Data Analysis, Springer. (your textbook)
Chambers (1998). Programming with Data, Springer.
Venables & Ripley (2002). Modern Applied Statistics with S, Springer.
Venables & Ripley (2000). S Programming, Springer.
Pinheiro & Bates (2000). Mixed-Effects Models in S and S-PLUS, Springer.
Murrell (2005). R Graphics, Chapman & Hall/CRC Press.
Other resources
Springer has a series of books called Use R!.
A longer list of books is at http://www.r-project.org/doc/bib/R-books.html
Comments
The hash character represents comments, so text following these characters is not interpreted:
When writing your own R scripts, it is strongly recommended that you write out comments (or include text if using an R Markdown notebook) that explain what each section of code is doing. This is very helpful both for collaborators, and for your future self who may have to review, run, or edit your code.