Learning Data Science

This page lists a number of resources for the budding data scientist and also the seasoned practitioner. If you come from applied mathematics, or some other quantitative/computational field, there’s not much separating you from a data scientist. These resources will help fill in the gaps.

I’m just getting started with these resources, so if you think there’s a resource I should add, give me a shout.

Fundamentals

R Programming

  • If you are already familiar with programming, then you might prefer my lecture notes for a workshop on R that I taught at Baruch. While the focus is on quantitative finance, the material is accessible to people from all fields.
  • For intermediate and advanced R users, I recommend taking a look at my book in progress, Modeling Data With Functional Programming In R, which discusses leveraging functional programming concepts to develop your models. The basic argument is that in the age of big data, functional programming has become the de facto paradigm used to parallelize models and systems. Writing in a functional style thus makes it easy to migrate your code to distributed computing frameworks, such as Spark and H2O.

Linear Algebra

Many machine learning concepts rely heavily on linear algebra. If the idea of learning a new field of math scares you, don’t worry, math books these days are not as terse as they used to be. When I was at university, you could determine how advanced a math course was based on how small the book was. By my senior year, it took two semesters to complete a 150 page book. Books on linear algebra tend to span this continuum depending on their audience. One consequence of the more accessible, more verbose approach is that different concepts are emphasized (and others de-emphasized). Here are two that I’ve taught with.

  • For the timid, the UC Davis Linear Algebra book by Cherney, Denton, and Waldron is your best option. Not only is it written in an informal style, it has numerous examples as well as its own hosted platform for doing the exercises. One aspect I like about CDW is that it discusses a number of practical examples, including optimization, which is a must for any data scientist.
  • For those looking for a refresher, the Kuttler book, Linear Algebra, Theory and Applications is more to the point. This book is far more comprehensive, so the preliminary chapter covers much of chapters 1-5 of CDW. Thus it dives in with matrices as linear transformations, which isn’t discussed until Chapter 6 in CDW!

Principal Components Analysis

Machine Learning

Somewhat of an umbrella term, machine learning covers numerous statistical learning techniques ranging from vanilla regression to more exotic approaches like differential evolution and neural networks.

  • The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman is considered by many as a classic in the field. While the authors claim that the book is not comprehensive, at nearly 700 pages, it is quite thorough both in its breadth and depth. What’s nice about this book is that there is a lot of discussion on the tradeoffs of different models, which helps to develop an intuition around the workings of various methods. For this book you should be comfortable with vector notation and calculus. Code example are available in R.
  • Another classic is Applied Predictive Modeling by Kuhn & Johnson. I’ve taught from this book and can say that students generally enjoyed the narrative approach of the text. Like ESL, the authors focus on making the material accessible and discussing the tradeoffs of different methods. To support a clean narrative, code examples are separate from the primary text and appear at the end of each chapter. At first this is a little disorienting, but once you get used to it, the flow is much nicer. Do note that this book is designed for practitioners and assumes a bit more prior knowledge than ESL.

Neural Networks