I’m pleased to share Part I of my new book “Introduction to Reproducible Science in R“. The purpose of this book is to approach model development and software development holistically to help make science and research more reproducible. The need for such a book arose from observing some of the challenges that I’ve seen teaching graduate courses in natural language processing and machine learning, as well as training my own staff to become effective data scientists. While quantitative reasoning and mathematics are important, often I found that the primary obstacle to good data science was reproducibility and repeatability: it’s difficult to quickly reproduce someone else’s results. And this causes myriad headaches:
- It’s difficult to validate a model
- It’s difficult to diagnose a model
- It’s difficult to improve/change a model
- It’s difficult to reuse parts of a model
Ultimately, without repeatability, it’s difficult to trust a model. And if we can’t trust our models, what good are they?
This book therefore focuses on the practical aspects of data science, the tools and workflows necessary to make work repeatable and efficient. Part I introduces a complete Linux-based toolchain to go from basic prototyping to full-fledged operational/production models. It introduces tools like bash, make, git, and docker. I show how all these tools fit together to help imbue structure and repeatability into your project. These are tools that (some) professional data scientists use and can be used throughout your career.
With this foundation established, Part II describes common model development workflows, from exploratory analysis to model design through to operationalization and reporting. I walk through these archetypical workflows and discuss the approaches and tools for accomplishing the steps in the workflows. Some examples include how to design code to compare models, effective approaches for testing code, how to create a server and schedule jobs to run models.
Finally, in Part III I dive into programming. Those that have read my blog know that I look at programming for data science differently from systems programming. This changes the way you program. I’m a strong advocate of functional programming for data science, because it fits better with the mathematics. This part introduces functional programming and discusses data structures from this perspective. I also show how to approach common problems in data science from this view.
In short, this book will not only help you become a better programmer, but a better scientist. I assume the reader knows how to program and has experience creating models. It is appropriate for practitioners, graduate students, and advanced upperclass undergraduates.
Any feedback is appreciated. Feel free to comment here or on Twitter.
What happened to my other book, “Modeling Data With Functional Programming In R”? For those curious, my editor and I decided to postpone publishing it until after this book. I decided that I needed to provide a foundation that people could use to appreciate this other book.
Where can I buy a ebook or PDF of your book
LikeLike
Hi, thanks for your interest. It will be published probably second half of 2019 by CRC Press/Chapman & Hall. I imagine it will be available on Amazon and elsewhere.
LikeLike
Hello, I am very interested in the book. When do you think it will be finished? Let me know the progress, I will promote on my site Analytixon.com. Thank you, Michael
LikeLike
Hi, thanks for your interest. It will be published probably second half of 2019 by CRC Press/Chapman & Hall. I imagine it will be available on Amazon and elsewhere.
LikeLike
Hi Brian — first off, thanks so much for making part of your book openly available! That’s great!
I had a question about this part of your book, which mentions ReproZip: “The facade of simplicity also hides the underlying implementation, creating a certain amount of vendor lock-in.”
I’m curious what you mean by this. ReproZip is 100% open source, openly developed, and maintained by folks dedicated to open source for open scholarship. Anyone can see how ReproZip implements the trace for reproducibility. There’s no vendor.
Furthermore, ReproZip bundles (.rpz) can be read by virtually any container or VM system, effectively *blocking* any vendor lock-in. The developers just added a singularity unpacker. The ReproZip bundles themselves are fancy tarballs, and can be read even by extracting them to a folder. So…there’s no lock-in.
I am just wondering if you can expand on that thought in the book? What makes you think there’s any lock-in or secrets to the way ReproZip is implemented?
LikeLike
Vicky, Thanks for your interest in my book. Let me first say that I don’t mean to malign neither ReproZip nor Reana. I think it’s great that there’s so much effort in making science more reproducible, and there are many valid approaches. When I say “vendor lock-in” I mean it in a colloquial sense. I know that there is no “vendor” per se and that the project is open source. That said, even choosing a CentOS versus debian-based Linux distribution is a form of lock-in, because there is a specific way of doing things that is incompatible with other approaches. Since I’m writing about a philosophy specific to R, I consider the bias towards Python a form of lock-in with a specific way of doing things. My book emphasizes learning and using the basic building blocks of UNIX to accomplish something similar (but with R) rather than using new tools/frameworks. I hope that clarifies what I wrote, and please let me know if you think I’ve mischaracterized ReproZip in any way. Warm regards, Brian
LikeLike