I’m teaching an R workshop for the Baruch MFE program. This is the first installment of the workshop and focuses on some basics, although we assume you already know how to program. Below are the contents for the full workshop.
PART I: PRELIMINARIES
PART II: STATISTICS
- A. Distributions, Sampling, and Regression
- B. Optimization and Linear Programming
PART III: STRUCTURING CODE
- A. Dispatching Systems
- B. Real World Development
To get the most out of the workshop, you need to have basic programming knowledge. At a minimum you should understand what control structures are and how variable scopes work.
Somewhere you will need to have a working copy of R. As R is open source and popular, it’s available on all major operating systems. To write R code you can use a standard text editor, like vim, or obtain an IDE (e.g. Eclipse or RStudio) if you prefer a visual editor. For the workshop, we will stick with vim.
While R is a language that comes with “batteries included”, there are additional packages that you will need for the workshop. These include:
Note that installing
tawny will get you all its dependencies including
There are a number of ways to get help. The most direct way is to use the R shell. Most functions provide a documentation page that is retrieved by prefixing the function name with a question mark. e.g.
?lm opens a help page on the function lm. If you don’t know the specific function, try the
help.search() command. At this point, you should know how to get help for this function!
Search engines are always an option, but with R it can at times be problematic due to the genericness of the letter. Many people have developed solutions to this problem, with the most popular being rseek or a filtered Google search.
The R community has a number of mailing lists for getting help in addition to special interest groups (e.g. R-SIG-Finance), while the younger generation seems to have opted for online Q&A sites as a primary resource.
The R Shell
There are a number of useful functions for interacting with the R shell. In many ways you can think of it as being a lightweight version of bash. Common operations like
rm() exist to list the objects you’ve created as well as remove them. You can also view your search() path and see which packages are loaded.
To install a new package from the shell, use the
install.packages() command. R will download and build the package while you wait. Don’t forget to load the package after you build it with the
In R, everything is a vector. This means that even primitives have a
> length(4)  1
This seemingly strange idea makes translating mathematical notation into code very easy since vector notation is built-in. That means no loops just to add two vectors together.
> 1:5 + c(1,2,3,4,5)  2 4 6 8 10
It also means mathematical properties are honored by default so operators behave as you expect. As we’ll discuss later, this behavior also extends to matrices.
> c(2,3,4) + 2  4 5 6 > c(2,3,4) * 2  4 6 8
From the examples, you can find two ways to create vectors. Other methods include
seq(), which creates a sequence of numbers based on a variety of rules.
To access elements within a vector, R provides many handy built-in constructs. The simplest is an indexing notation. More complicated expressions can be applied as well.
> a <- 1:10 > a  4 > a[a>6]  7 8 9 10
This works because R is evaluating the expression across the vector so any function that returns booleans properly indexed to the vector will yield deterministic results. This property is used to apply sorting over a vector.
> b <- sample(1:10) > b[order(b)]  1 2 3 4 5 6 7 8 9 10
What do you suppose is the output of the
Elements in a vector can also be named. Once defined, these names can be used to access elements.
> names(a) <- strsplit('abcdefghij','', fixed=TRUE)[] > a a b c d e f g h i j 1 2 3 4 5 6 7 8 9 10 > a['c'] c 3
Operators and functions
By default vectors are defined as column vectors. This is true of the internal data structure as well. To create a row vector, use the transpose function,
> t(1:4) [,1] [,2] [,3] [,4] [1,] 1 2 3 4
Other common functions like inner and outer product are defined as operators,
> c(1,2,3) %*% c(3,2,1) [,1] [1,] 10 > c(1,2,3) %o% c(3,2,1) [,1] [,2] [,3] [1,] 3 2 1 [2,] 6 4 2 [3,] 9 6 3
When working with vectors, R tries its best to protect you from any obvious mistakes, like incompatible lengths between the operator. In general, R attempts to do the right thing while issuing errors for any glaring problems.
Arrays and Matrices
Arrays are vectors that have a dim(ension) attribute. Matrices are simply two-dimensional arrays. Each of these types have a constructor:
matrix(), respectively. When creating a matrix, note that it is constructed along columns. You can override this behavior but be aware that the performance may degrade since the internal representation is based on columns.
> matrix(1:6, nrow=2) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
Since matrices have two dimensions, the names() function will not work to access a matrices column or row names. Instead there are
length() is not appropriate for matrices; use
Accessing specific elements of a matrix is accomplished using similar subsetting notation. Since there are two dimensions, an index can be applied to either dimension, or a full column or row can be accessed. Notice that the printed output of the matrix actually shows you the notation.
rnorm() to generate a 20 x 6 matrix. Add column names to the matrix: C, F, T, A, D, K. Extract column A. How do you extract more than one column?
Another common technique for creating matrices is to use either
cbind() with existing vectors or arrays.
A list is a general purpose data structure or object that stores named elements. Objects can be stored within a list at multiple levels. Once a list is created, elements can be accessed by name or index. When using an index, typically the special double bracket notation is used unless you want another list back .
> li <- list(a=rnorm(5), b=1:8, c='label') > li$b  1 2 3 4 5 6 7 8 > li[]  1 2 3 4 5 6 7 8 > li $b  1 2 3 4 5 6 7 8
While matrices require data to be of a consistent type, a data.frame allows arbitrary types for each column.
A factor is essentially an enum. Performance benefits exist when using factors for grouping or filtering since the comparison is faster than with a string. Be careful, though, as R will attempt to convert string data to factors by default, which can result in unexpected behavior.
Sometimes the data you get needs to be converted to a different format. Most type constructors have corresponding as.* functions to coerce data into the given type. A typical usage is converting a string to a date via
Exercise: Given the following data.frame, get the average of the values for label b.
> l <- sample(strsplit('abcdefg','',fixed=TRUE)[],10,replace=TRUE) > d <- data.frame(cbind(value=rnorm(10), label=l))
anytypes() to see the type for each column in the data.frame.
The most common method for getting data into R is by reading a file. Typically the family of read and write functions are used for general purpose reading and writing of data.frames, while scan is sometimes used directly when reading in all numeric matrices (as an example).
> df <- read.csv(textConnection("a,b,c + 1,2,3 + 4,5,6 + 7,8,9")) > write.csv(df, file='dummy.csv') > dd <- read.csv('dummy.csv')
Notice anything strange when reading this back in?