Slides for “Achieving Practical Reproducibility with Transparency and Accessibility” (DSSV 2020)

I was invited to speak at the ASA’s Symposium on Data Science and Statistics as well as the SAMSI/IASC conference on Data Science, Statistics, and Visualization thanks to Jim Harner at WVU. Both talks were on my approach to reproducible science based on my forthcoming book Introduction to Reproducible Science in R.

The talks make two key points:

  • The scientific method can be deconstructed into methodology and environment. Code is your methodology and your workstation is your environment.
  • Reproducibility is a function of transparency and accessibility

I’ve seen a lot of emphasis on tools to improve reproducibility but less so on process. Even if we solve the problem of being able to reproduce someone else’s work exactly (same data, same code) easily, the tool cannot interpret the methodology for us.

Here’s a simple example I learned in college. Suppose I have a black box that can compute 16/64. When you run the function you get the correct answer: 1/4. However, the method simply cancels the sixes to yield 1 over 4.

Cancel the sixes

You may get desired results but for the wrong reasons. And bad actors will produce desired results by cheating/lying. Transparency ensures others can verify that results are credible. The hydroxychloroquine sham dataset provided by Surgisphere is a poignant example showing how data provenance is one component of transparency. An earlier “study” by the CEO of Surgisphere included doctored images as “results”. Without a transparent method, it’s hard to root out bad actors.

Related to transparency is accessibility: are people of differing ability levels able to reproduce results? In the deep learning realm, many models are not accessible due to their size. GPT-3 is estimated to have cost $4.5mm to train, which means few people can reproduce the results of GPT-3. (That said, the deep learning solution to this problem is to use transfer learning to demonstrate generalizability of the model, thus skirting the issue of strict reproducibility).

The complete presentation from DSSV goes into a bit more detail. Enjoy!