This is the second part in a three part series on teaching R to MFE students at CUNY Baruch. The focus of this lesson is on basic statistical methods in R.

Contents

PART I: PRELIMINARIES

PART II: STATISTICS

PART III: STRUCTURING CODE

Distributions

As a statistical programming language, R provides many functions for describing distributions. Standard distributions include normal, binomial, uniform, chi-squared, etc. Others are available from add-on packages.

Distribution functions follow a standard pattern in R, which makes it easy to use any of the available variants. The stats package, which is automatically loaded, provides functions for many common distributions. To see what is available, try ?Distributions.

Prefix Description Example
d* density function (PDF) dnorm
p* distribution function (CDF) pnorm
q* quantile function (inverse CDF) qnorm
r* random generation rnorm

Let’s look at a plot for each of these functions. Notice the usage of par and plot to generate each chart. When adding multiple charts to a plot, special parameters must be set to configure the layout. When you are done, it’s good practice to reset the configuration by calling par again with the old par object (as returned by the function previously).

opar <- par(mfrow=c(2,2), mar=c(3,5,1,1)+0.1)
x <- seq(-5,5, by=0.1)
y <- seq(0,1, by=0.02)

plot(x, dnorm(x), t='l')
plot(x, pnorm(x), t='l')
plot(y, qnorm(y), t='l')
plot(x, rnorm(x))
par(opar)

The graphics capabilities in R are quite extensive [1] but can be quite cumbersome to work with. There are numerous packages that attempt to improve the way graphics are handled in R including ggplot, lattice [3].

Exercise: Write a function plot_dists(name) that generates the above plot for the given distribution.

Analyzing Distributions

It is easy to spot check whether a given set of data fits a distribution by using a QQ plot. The qqnorm and qqline functions use a normal distribution as a default. For other distributions, use qqplot to explicitly specify the x data series.

h <- getPortfolioReturns(c('AAPL','XOM','KO','F','GS'), 100)
qqplot(h$XOM)
qqline(h$XOM)


As expected, the tails do not fit a normal distribution. To align the theoretical quantiles with the sample quantiles, you have two approaches: normalize the sample data or use a different set of theoretical quantiles. [2]

z <- (h$XOM - mean(h$XOM)) / sd(as.vector(h$XOM))
qqnorm(z)
qqline(z)

Exercise: Generate a QQ plot using a better fit for the theoretical distribution. Hint: Try using rnorm.

Parameter Estimation

To fit data to a particular distribution, you can use any of the numerous optimization functions (optim, nlm, optimize, etc) to minimize your log likelihood function. Here is a contrived example of estimating the mean and variance of one of our stocks.

get_estimator <- function(x)
{
  n <- length(x)
  function(theta)
  {
    n/2*log(2*pi) + n/2*log(theta[2]) + 1/(2*theta[2]) * sum((x-theta[1])^2)
  }
}

Let’s execute the function and compare with the built-in functions.

> f <- get_estimator(h$XOM)
> optim(c(0,1), f)$par
[1] 0.0020516656 0.0002079123
> mean(h$XOM)
[1] 0.002052150
> sd(h$XOM)^2
         XOM
0.0002100090

Note the use of a closure in this example. Once we have the closure, it is easy to pass the closure to other optimization functions for experimentation. We will discuss optimization in more detail in the next section.

Sampling Data

R provides many methods for sampling data from an existing data set, depending on what you want to accomplish. The simplest is sample, which selects entries from the specified vector. By default, the probability of selecting a specific element is equally-weighted over the elements. For example, this is appropriate when drawing cards from a deck.

> cards <- sapply(c('H','S','C','D'), function(x) paste(x, 1:13, sep=''))
> sample(cards, 5)
[1] "C12" "S6" "H10" "S12" "D7"

To bootstrap a data set, set replace=TRUE, which will reuse all elements for each sample selection. This option is appropriate when flipping a coin multiple times.

> sample(c('H','T'), 10, replace=TRUE)
 [1] "T" "H" "H" "H" "H" "T" "T" "T" "H" "H"

For more sophisticated bootstrapping, see the boot package [4].

Covariance and Correlation

Not surprisingly, there are built-in functions for calculating the covariance (cov) and correlation (cor) between data. You can even convert between them with cov2cor.

> cov(h$XOM, h$F)

Exercise: For portfolio h, write a function to calculate the covariance matrix without using cov.

Regression

Linear regression is performed by using the lm function, which fits data to linear models. We can use it to estimate the parameters of the CAPM for our toy portfolio. Note we assume the risk free rate is 0 to simplify the example.

> ws <- t(sapply(1:100, function(x) rdirichlet(rep(1,5)) ))
> p <- apply(h * ws, 1, sum)
> m <- getPortfolioReturns('^GSPC',100)
> m$portfolio <- p
> colnames(m) <- c('market','portfolio')
> out <- lm(portfolio ~ market, m)
> summary(out)

In this example we are generating random weights for each day to calculate our portfolio returns. Afterward, we download our returns for our market and run the regression.

> plot(as.vector(m$market), as.vector(m$portfolio))
> abline(out)

A good discussion of doing regression analysis in R is in [5]. Other forms of regression analysis are supported as well but is out of scope for this document.

References

[1] An Introduction to R Graphics
[2] Fitting Distributions With R
[3] Lattice - Multivariate Data Visualization with R - Figures and Code
[4] Bootstrapping Regression Models
[5] Practical Regression and Anova using R

About these ads