What you need to know about data augmentation for machine learning

Plentiful high-quality data is the key to great machine learning models. But good data doesn’t grow on trees, and that scarcity can impede the development of a model. One way to get around a lack of data is to augment your dataset. Smart approaches to programmatic data augmentation can increase the size of your training set 10-fold or more. Even better, your model will often be more robust (and prevent overfitting) and can even be simpler due to a better training set.

There are many approaches to augmenting data. The simplest approaches include adding noise and applying transformations on existing data. Imputation and dimensional reduction can be used to add samples in sparse areas of the dataset. More advanced approaches include simulation of data based on dynamic systems or evolutionary systems. In this post we’ll focus on the two simplest approaches: adding noise and applying transformations.

Regression Problems

In many datasets we expect that there is unavoidable statistical noise due to sampling and other factors. For regression problems, we can explicitly add noise to our explanatory variables. Doing so can make the model more robust, although we need to take care when constructing the noise term. First, we want to avoid adding bias. We also need to ensure the noise is independent.

In my function approximation example, I demonstrated creating a simple neural network in Torch to approximate a function. Let’s use the same dataset and function but this time add noise as a preprocessing step. Before, I generated the training data directly in Torch. However, in my simple workflow for deep learning, I said I prefer using R for everything but training the model. Hence, I generate and add noise in R and then write out a headerless CSV file. This process expands the training set from 40k samples to ~160k samples.

The following function accomplishes this and uses an arbitrary function to add noise. The default function adds uniform noise.

perturb_unif <- function(df, mult=4, fr=function(a) a + runif(length(a)) - .5) {
  fn <- function(i) data.frame(x=fr(df[,1]), y=fr(df[,2]), z=df[,3])
  o <- lapply(1:mult, fn)
  do.call(rbind, o)
}

Exercise: What is the purpose of the -0.5 term?
Exercise: What are some other valid noise functions? Why would you choose one over another?

I then read this file into Lua using a custom function loadTrainingSet, which is part of my deep_learning_ex guide. This function reads the CSV and creates a Torch-compatible table comprising the input and output tensors. The function simply assumes that the last column of the CSV is the output.

Using this approach, it’s possible to create a 20 hidden node network that performs as well as the 40 node network in the earlier post. Think about this: by adding noise (and increasing the size of the training set), we’ve managed to reduce the complexity of the network 2-fold.

Adding noise makes the model simpler and more robust

Adding noise simplifies the model and makes it more robust

Kool-Aid Alert: Depending on the domain and hyper parameters chosen, this approach may not produce desirable results. As with most deep learning exercises, tuning of hyper parameters is mandatory.

Classification Problems

Noise can be used in classification problems as well. One particularly useful application is balancing a dataset. In a binary classification problem, suppose the data is split 80-20 between two classes. It is well known that such an unbalanced set is problematic for machine learning algorithms. Some will even just default to the class with the higher proportion of observations, since the naïve accuracy is reasonable. In small datasets balancing the dataset by trimming can be counterproductive. The alternative is to increase the samples in the smaller class. The same noise augmentation approach works well here.

Another group of classification problem is image classification. Following a similar approach, noise can be added to images, which can make the model more robust. Another popular technique is transforming the data. This makes sense for images since changes in perspective can change the apparent shape of an object. Transparent or reflective objects can also distort objects, but despite this distortion, we know the object to be the same. Affine transformations provide simple linear transforms that can expand a dataset. This includes shifting, scaling, rotating, flipping, etc. While a good starting point, andsome problems might benefit from more complex transformations.

Still a cat

Transformed images can be generated in a pre-processing step as above. The Torch image toolkit can be used this way. For example, here is how to flip an image horizontally.

img = image.load("cat.jpg",3,'byte')
img1 = image.hflip(img)

Original cat

Flipped cat

Alternatively, these variations can be generated inline during the training process. Keras uses this approach with the ImageDataGenerator class. This means that images are transformed on the fly during training.

Exercise: Which approach will produce better results? Why?

Conclusion

Just because you don’t have as much data as Google or Facebook doesn’t mean you should give up on machine learning. By augmenting your dataset, you can get excellent results with small data.

Use approaches not mentioned above to augment your data? Share in the comments below.

6 thoughts on “What you need to know about data augmentation for machine learning”

Matt said:

October 11, 2016 at 3:43 pm

Thank you for writing this article. It reminded me of reading another article a few years back that adding noise to data can improve the generalization of a model.

A question: I noticed in the ‘perturb_unif’ function you added noise to the x and y variables only. Can I know what is your motivation for doing it and why not differently i.e. adding the noise to all variables or just to the target (z) variable?

Thanks.

LikeLike

- Brian Lee Yung Rowe said:
  
  October 11, 2016 at 4:43 pm
  
  This question was asked in class as well. The basic idea is to think of this as a signal processing problem. From this perspective, you have a signal that is being distorted/is in a noisy channel. The goal is to recover the original signal. Hence, you might have a number of inputs that all represent the same output value and are distorted in different ways. From a model perspective the idea is to train the network to treat certain variations as the same output value. Does that make sense?
  
  LikeLike
  
  - Matt said:
    
    October 12, 2016 at 12:19 pm
    
    Thanks for feedback. Yes, I guess it makes sense when you put it that way. In the end, it all depends on the problem and objective to achieve.
    
    LikeLike
  - Brian Lee Yung Rowe said:
    
    October 12, 2016 at 3:55 pm
    
    Yes definitely. There’s an argument/use case for perturbing z as well.
    
    LikeLike
Christian Lenaburg said:

October 13, 2016 at 7:55 am

There is a good tutorial of augmenting data with adversarial examples here: https://github.com/ilblackdragon/adversarial_workshop

LikeLike

scottturneruon said:

November 21, 2016 at 2:04 am

Thank you for this post, it was an interesting read.

LikeLike

Cartesian Faith

~ Insights of a modern alchemist

What you need to know about data augmentation for machine learning

Regression Problems

Classification Problems

Conclusion

6 thoughts on “What you need to know about data augmentation for machine learning”

Leave a comment Cancel reply

Regression Problems

Classification Problems

Conclusion

Share this:

Related

6 thoughts on “What you need to know about data augmentation for machine learning”

Leave a comment Cancel reply