How to reliably access network resources in R

It’s frustrating when an application unexpectedly dies due to a network timeout or unavailability of a network resource. Veterans of distributed systems know not to rely on network-based resources, such as web services or databases, since they can be unpredictable. So what is a data scientist supposed to do when you must use these resources in her analysis/application?

When there is a true network partition, there’s not much you can do since these resources are inaccessible. Most of the time, though, the issue is a timeout due to network latency or an unresponsive server. In these situations, the problem is temporary. It would be nice to recover from the error without having to add a bunch of logic and muddying up your model code. Recovery can be as simple as trying again, eventually failing if a resource is truly unavailable.

The new function ntry in lambda.tools 1.0.5 does just this: call a function up to n times, returning the result of the first successful call.

Here’s an example of how it works. The following function simulates an unreliable resource that fails 75% of the time. Using ntry, the function will be tried over and over until it either succeeds or the limit is reached.

library(lambda.tools)
library(futile.logger)

fn <- function(i) {
  x <- sample(1:4, 1)
  flog.info("x = %s",x)
  if (x < 4) stop('stop') else x
}

Calling the function in isolation will mostly likely fail:

> fn()
INFO [2015-01-21 18:26:21] x = 2
Error in fn() : stop

This is similar to what happens with a timeout, where sometimes a function will fail. To get around this, normally a loop of some sort is introduced to try a few times until the call succeeds. With ntry it’s simply a matter of wrapping a function in a closure and specifying the number of tries.

> ntry(fn, 6)
INFO [2015-01-21 18:39:21] x = 2
INFO [2015-01-21 18:39:21] x = 4
[1] 4

Here’s a real-world example using RPostgreSQL. In a single function, a connection is opened, the query executed, and the connection closed.

db_execute_query <- function(query) {
  on.exit(dbDisconnect(con))
  drv <- dbDriver("PostgreSQL")
  con <- dbConnect(drv, host=HOST, port=PORT, dbname=DATABASE,
    user=USER, password=PASS)

  dbGetQuery(con, statement=query)
}

For this to work with ntry, I use the on.exit function to disconnect. Normally I’d use a tryCatch block, but since ntry will catch the error, I leave this code naked. The ntry wraps the DB call in a closure, where the argument i is the attempt number. This is useful if you want to debug the call. The second parameter is simply the number of tries.

df <- ntry(function(i) db_execute_query(query), 3)

Access to the database is now a bit more resilient. To try it out yourself, install the latest version of lambda.tools via devtools.

library(devtools)
install_github('lambda.tools','zatonovo')
Follow

Get every new post delivered to your Inbox.

Join 366 other followers