This is the second part in a three part series on teaching R to MFE students at CUNY Baruch. The focus of this lesson is on basic statistical methods in R.

### Contents

#### PART I: PRELIMINARIES

#### PART II: STATISTICS

- A. Distributions, Sampling, and Regression
- B. Optimization and Linear Programming

#### PART III: STRUCTURING CODE

- A. Dispatching Systems
- B. Real World Development

### Distributions

As a statistical programming language, R provides many functions for describing distributions. Standard distributions include normal, binomial, uniform, chi-squared, etc. Others are available from add-on packages.

Distribution functions follow a standard pattern in R, which makes it easy to use any of the available variants. The stats package, which is automatically loaded, provides functions for many common distributions. To see what is available, try `?Distributions`

.

Prefix | Description | Example |
---|---|---|

d* | density function (PDF) | `dnorm` |

p* | distribution function (CDF) | `pnorm` |

q* | quantile function (inverse CDF) | `qnorm` |

r* | random generation | `rnorm` |

Let’s look at a plot for each of these functions. Notice the usage of `par`

and `plot`

to generate each chart. When adding multiple charts to a plot, special parameters must be set to configure the layout. When you are done, it’s good practice to reset the configuration by calling `par`

again with the old par object (as returned by the function previously).

opar <- par(mfrow=c(2,2), mar=c(3,5,1,1)+0.1) x <- seq(-5,5, by=0.1) y <- seq(0,1, by=0.02) plot(x, dnorm(x), t='l') plot(x, pnorm(x), t='l') plot(y, qnorm(y), t='l') plot(x, rnorm(x)) par(opar)

The graphics capabilities in R are quite extensive [1] but can be quite cumbersome to work with. There are numerous packages that attempt to improve the way graphics are handled in R including ggplot, lattice [3].

* Exercise*: Write a function plot_dists(name) that generates the above plot for the given distribution.

#### Analyzing Distributions

It is easy to spot check whether a given set of data fits a distribution by using a QQ plot. The `qqnorm`

and `qqline`

functions use a normal distribution as a default. For other distributions, use `qqplot`

to explicitly specify the x data series.

h <- getPortfolioReturns(c('AAPL','XOM','KO','F','GS'), 100) qqplot(h$XOM) qqline(h$XOM)

As expected, the tails do not fit a normal distribution. To align the theoretical quantiles with the sample quantiles, you have two approaches: normalize the sample data or use a different set of theoretical quantiles. [2]

z <- (h$XOM - mean(h$XOM)) / sd(as.vector(h$XOM)) qqnorm(z) qqline(z)

** Exercise**: Generate a QQ plot using a better fit for the theoretical distribution. Hint: Try using

`rnorm`

.#### Parameter Estimation

To fit data to a particular distribution, you can use any of the numerous optimization functions (optim, nlm, optimize, etc) to minimize your log likelihood function. Here is a contrived example of estimating the mean and variance of one of our stocks.

get_estimator <- function(x) { n <- length(x) function(theta) { n/2*log(2*pi) + n/2*log(theta[2]) + 1/(2*theta[2]) * sum((x-theta[1])^2) } }

Let’s execute the function and compare with the built-in functions.

> f <- get_estimator(h$XOM) > optim(c(0,1), f)$par [1] 0.0020516656 0.0002079123 > mean(h$XOM) [1] 0.002052150 > sd(h$XOM)^2 XOM 0.0002100090

Note the use of a closure in this example. Once we have the closure, it is easy to pass the closure to other optimization functions for experimentation. We will discuss optimization in more detail in the next section.

### Sampling Data

R provides many methods for sampling data from an existing data set, depending on what you want to accomplish. The simplest is `sample`

, which selects entries from the specified vector. By default, the probability of selecting a specific element is equally-weighted over the elements. For example, this is appropriate when drawing cards from a deck.

> cards <- sapply(c('H','S','C','D'), function(x) paste(x, 1:13, sep='')) > sample(cards, 5) [1] "C12" "S6" "H10" "S12" "D7"

To bootstrap a data set, set `replace=TRUE`

, which will reuse all elements for each sample selection. This option is appropriate when flipping a coin multiple times.

> sample(c('H','T'), 10, replace=TRUE) [1] "T" "H" "H" "H" "H" "T" "T" "T" "H" "H"

For more sophisticated bootstrapping, see the boot package [4].

### Covariance and Correlation

Not surprisingly, there are built-in functions for calculating the covariance (`cov`

) and correlation (`cor`

) between data. You can even convert between them with `cov2cor`

.

> cov(h$XOM, h$F)

** Exercise**: For portfolio h, write a function to calculate the covariance matrix without using

`cov`

.### Regression

Linear regression is performed by using the `lm`

function, which fits data to linear models. We can use it to estimate the parameters of the CAPM for our toy portfolio. Note we assume the risk free rate is 0 to simplify the example.

> ws <- t(sapply(1:100, function(x) rdirichlet(rep(1,5)) )) > p <- apply(h * ws, 1, sum) > m <- getPortfolioReturns('^GSPC',100) > m$portfolio <- p > colnames(m) <- c('market','portfolio') > out <- lm(portfolio ~ market, m) > summary(out)

In this example we are generating random weights for each day to calculate our portfolio returns. Afterward, we download our returns for our market and run the regression.

> plot(as.vector(m$market), as.vector(m$portfolio)) > abline(out)

A good discussion of doing regression analysis in R is in [5]. Other forms of regression analysis are supported as well but is out of scope for this document.

### References

[1] An Introduction to R Graphics

[2] Fitting Distributions With R

[3] *Lattice* – Multivariate Data Visualization with *R* – Figures and Code

[4] Bootstrapping Regression Models

[5] Practical Regression and Anova using R

swtzang

said:Regarding your code:

ws <- t(sapply(1:100, function(x) rdirichlet(rep(1,5)) ))

The returned message is

Error in length(alpha) : 'alpha' is missing

I think there may be some missing value to generate random draws from this distribution. Any suggestion on this? Thanks for your help.

Brian Lee Yung Rowe

said:Hello, I did forget to mention that you need to have the bayesm package installed and loaded to run that:

install.packages(‘bayesm’)

library(bayesm)

Curiously, even without the loading the library, I wasn’t able to get the error message you’re seeing. What version of R are you running? I’m running 2.14.0.

swtzang

said:I am still have this problem after installing the package you mentioned. I am running 2.14.1.

Brian Lee Yung Rowe

said:Can you provide the console output so I can see exactly the commands you are using? Seems like there’s a disconnect somewhere.