Testing is an often overlooked yet critical component of any software system. In some ways this is more true of models than traditional software. The reason is that computational systems must function correctly at both the system level and the model level. This article provides some guidelines and tips to increase the certainty around the correctness of your models.

Testing is a critical part of any system

Guiding Principles

One of my mantras is that a good tool extends our ability and never gets in our way. I avoid many libraries and applications because the tool gets in my way more than it helps me. I look at tests the same way. If it takes too long to test, then the relative utility of the test is lower than the utility of my time. When that happens I stop writing tests. To ensure tests are worthwhile, I follow a handful of principles when writing tests.

In general, tests should be:

  • self contained – don’t rely on code outside of the test block;
  • isolated – independent and not commingled with other test cases;
  • unique – focus on a unique behavior or property of a function that has not been previously tested;
  • useful – focus on edge cases and inputs where the function may behave erratically or wildly;
  • scalable – easy to write variations without a lot of ceremony.

By following these principles, you can maximize the utility of the tests with minimal effort.

Testing as Metropolis-Hastings

I like to think about testing as an application of MCMC. Think about the function you want to test. If written without side effects, then for each input you get an output that you can examine. Covering every single input is typically untenable, so the goal is to optimize this process. That’s where the MCMC concept comes into play, specifically with the Metropolis-Hastings algorithm (M-H). This technique produces random samples that follows an arbitrary probability distribution. What this means is that where a distribution is dense, there will be more points versus in an area with low probability.

Now think about testing. Usually we care about edge cases and boundary conditions as this is where a function may behave unexpectedly. We need more tests for these conditions and less tests where we know values are well-behaved. In terms of M-H, given a probability distribution of likely inputs, we want to create test cases according to the inverse of this distribution! Following this approach, it’s actually possible to generate test cases randomly with a high degree of certainty that they cover the most critical inputs of the function.

Sane Test Creation and Management

It’s not uncommon for tests to be written at the get-go and then forgotten about. Remember that as code changes or incorrect behavior is found, new tests need to be written or existing tests need to be modified. Possibly worse than having no tests is having a bunch of tests spitting out false positives. This is because humans are prone to habituation and desensitization. It’s easy to become habituated to false positives to the point where we no longer pay attention to them.

Temporarily disabling tests may be acceptable in the short term. A more strategic solution is to optimize your test writing. The easier it is to create and modify tests, the more likely they will be correct and continue to provide value. For my testing, I generally write code to automate a lot of wiring to verify results programmatically.

The following example is from one of my interns. Most of our work at Pez.AI is in natural language processing, so we spend a lot of time constructing mini-models to study different aspects of language. One of our tools splits sentences into smaller phrases based on grammatical markers. The current test looks like

                title="My printer is offline, how do I get it back on line?", 
                timestamp="2016-05-13 08:50:00",
                text="My printer is always offline, how do I get it back on line?",

output_wds1<- data.frame(thread.id=c(5618300,5618300,5618300),
                        text.id= unlist(lapply(c(1,1,1), as.integer)),
                        start=unlist(lapply(c(1,5,7), as.numeric)),
                        end=unlist(lapply(c(4,6,8), as.numeric)),
                        text=c('how do i get','it back','on line'))


test_that("Test for HP forum", {
  expect_equivalent(mark_sentence(df1), output_wds1)
  expect_equivalent(mark_sentence(df2), output_wds2)

The original code actually contains a second test case, which is referenced in the test_that block. There are a number of issues with this construction. The first three principles are violated (can you explain why?) not to mention that it’s difficult to construct new tests easily. Fixing the violated principles is easy, since it just involves rearranging the code. Making the tests easier to write takes a bit more thought.

test_that("Output of mark_sentence is well-formed", {
  df <- data.frame(forum="Forums", 
    title="My printer is offline, how do I get it back on line?", 
    timestamp="2016-05-13 08:50:00",
    text="My printer is always offline, how do I get it back on line?",

  # act = actual, exp = expected
  act <- mark_sentence(df)

  expect_equal(act$fragment.pos, c('C','C','P'))
  expect_equal(act$fragment.type, c('how','it','on'))
  expect_equal(act$start, c(1,5,7))
  expect_equal(act$end, c(4,6,8))

Now the test is looking better. That said, it can still be a pain to construct the expected results. An alternative approach is to use a data structure that embodies the test conditions and use that directly.

exp <- read.csv(text='
', header=TRUE, stringsAsFactors=FALSE)
fold(colnames(exp), function(col) expect_that(exp[,col], act[,col]))

(Note that fold is in my package lambda.tools.)

The rationale behind this approach is that it can be tedious to construct complex data structures by hand. Instead, you can produce the result by calling the function to test directly, and then copy it back into the test case. The caveat is that doing this blindly doesn’t test anything, so you need to review the results before “blessing” them.

It’s easy to see how such a structure can easily extend to providing an input file and an expected output file, so even more scenarios can be run in a single shot.


For tests to work, we must remain sensitized to them. They need to cause us a bit of anxiety and motivate us to correct them. To maintain their effectiveness, it’s important that tests are easy to create, maintain, and produce correct results. Following the principles above will help you optimize the utility of your tests.

This site generously supported by Datacamp. Datacamp offers a free interactive introduction to R coding tutorial as an additional resource. Already over 100,000 people took this free tutorial to sharpen their R coding skills.