Best practices for logging computational systems in R and Python

Tags

, ,

As is the case with most quant software, it’s a bit different from run-of-the-mill software. The somewhat prosaic world of logging is one such place where there are some differences. What’s different about quant systems? First, they have multiple run modes. Particularly in finance, models often run in real-time but also historically. Models may also run incrementally or in batch. These scenarios each have their own logging needs. In particular, real-time processes typically need to show information about every event, whereas batch processes might only log every n events/samples. Alternatively, incremental processes may require a lot of debugging information, whereas the performance of batch processes will suffer with too much logging.

A good approach that balances these different scenarios is to log messages to both the console and a file. For interactive use the console is most convenient since the feedback is immediate. For long batch processes or running as a system, having the information in a log file is ideal. Trying to search for a log event in the console is a good way to waste a lot of time, whereas as a file, you can take advantage of standard UNIX tools like grep and sed. Another nice thing about two loggers is that you can assign different log levels to each one. The console can be verbose, while the file can be more terse to ease searching for information. The same approach works in both Python and R.

Python

Logging comes as “included batteries” in Python. The stock logging package is powerful and flexible. The drawback is that it can be onerous to setup. In large systems, it’s not uncommon to see a dozen logging configurations interspersed in code, each with their own setup, message format, and log levels. It’s much better to standardize the logging and apply it once at the top-level of the application. I usually talk about software being like an onion or a cabbage, with multiple layers of abstraction. The outer layers are tough and designed to protect against the outside world. As you go towards the center, it is clean and pure. The consequence is that the core assumes correct data types and whatnot, while it’s the responsibility of the outer layers to ensure proper data types and formats. This makes testing easier, which I discussed in a prior post. It also makes it easier to re-use functions since the pure core is more readily shared, while the use-case-specific or environment-specific code is wrapped up in the outer layers.

Anyway, back to logging. Following this principle, logging configuration is application specific, so it should be in the outer layers. In Python, there is usually a main function that acts as the entry point of a script, such as

if __name__ == '__main__':
  # Call app entry point

This is a good place to initialize your logging. Then all modules within your package/system assume logging is configured. If it isn’t Python will tell you that logging hasn’t been initialized, so you know when this assumption is faulty.

Python logging demands intialization. The simplest is to use a one-liner such as logging.basicConfig(level=logging.INFO), which sets the ROOT logger level to INFO and prints to stdout. But you’ll grow out of this in a few days, particularly because the default message format is not great. Better to use a configuration file to manage the logging system. The default ini-style log configurations are somewhat arcane and opaque due to the way sections have to be referenced by other variables. The newer YAML syntax is more intuitive. As a markup language YAML is becoming more commonplace, so it’s a reasonable replacement.

Our goals for logging include:

  1. define a useful message format that is both informative and easy to read;
  2. a verbose logger that writes to the console;
  3. a less verbose logger that writes to a file.

The following configuration satisfies the above goals. Typically, this configuration can be placed in a conf directory that contains all configuration files.

version: 1
formatters:
  simpleFormater:
    format: '%(levelname)7s [%(asctime)s] %(name)s: %(message)s'
    datefmt: '%Y-%m-%d %H:%M:%S'
  
handlers:
  console:
    class: logging.StreamHandler
    formatter: simpleFormater
    stream: ext://sys.stdout
  file:
    level: INFO
    class : logging.FileHandler
    formatter: simpleFormater
    filename: example.log

loggers:
  clogger:
    handlers: [console]
  flogger:
    handlers: [file]

root:
  level: DEBUG
  handlers: [console, file]

Let’s call this config file example_log.yaml. Using the logging configuration is easy: open the file and tell the logging system to use it.

import logging.config, yaml
with open('example_log.yaml') as f:
  D = yaml.load(f)
  logging.config.dictConfig(D)

Now in your modules, you create a logger and log messages as regular.

logger = logging.getLogger(__name__)
logger.info("This is an INFO log message")
logger.warning("This is a WARNING message")
logger.debug("This is a DEBUG message")

In this example, all three messages will be output to the console, but only two will be written to disk.

This approach works for most purposes. If you want to change the name of the log file, update the filename property of the file handler from example.log to your preferred log file name. For additional customizations, refer to the Python Logging Cookbook.

R

In R, logging is not a in the included batteries. To fill this void, I wrote the futile.logger package to provide an easy to use logging facility in R. This package has similar semantics to Python’s logging. The primary difference is that a convenient initial configuration is provided by default. In fact, to get console output, no configuration is needed at all! For example, after installing from CRAN or github, you can log immediately to the console with a threshold of INFO

> flog.info("My first log message")
INFO [2016-07-07 15:44:44] My first log message

futile.logger also supports multiple loggers and can write to files. All interaction is done via functions, where a logger is specified. By default, the ROOT logger is used, so if you don’t need more than that, don’t worry about the logger name. For example, to change the log threshold to DEBUG, use flog.threshold.
> flog.threshold(DEBUG)
> flog.debug(“My debug log message”)
DEBUG [2016-07-07 15:52:21] My debug log message
[/sourcecode]

This means during an interactive run, you can quickly turn on DEBUG level messages, while the normal code runs at INFO level. If you want to write to both console and file, specify a new appender using flog.appender. Many appenders come bundled with futile.logger. Here we’ll use the tee appender to write to console and file.

> flog.appender(appender.tee('example.log'))
NULL
> flog.info("Writing to file")
INFO [2016-07-07 15:54:16] Writing to file

Unlike Python logging, futile.logger does not support configuration files. Instead, a code-based configuration must be used. The same principle can be followed though, where the configuration should be at the top level of the code. For R packages, this can be in the .onLoad(libname, pkgname) function. More information is available in the source.

Other Considerations

While logging greatly helps debugging code and understanding what a system is doing, there are a few things to be wary of. First, make sure not to accidentally commit your log files into your repository. The best way to prevent this is to add *.log in your .gitignore file.

Conclusion

With a bit of forethought, logging can be beneficial to your system and development process. By using the recipes above, you can quickly take advantage of logging in your data science projects.

Follow

Get every new post delivered to your Inbox.

Join 567 other followers