Keras is a package that was originally written for Python which allows one to easily construct and train neural networks. Keras sits upon several other neural network packages (such as TensorFlow), but is easier to use, allowing one to write short, consistent code, meaning you spend less time making mistakes and searching for bugs. In fact, Keras is now the official front-end of TensorFlow for exactly this reason.
Keras entered the Python world in 2015, and really propelled and sustained the use of Python for neural networks and more general machine learning. R, however, did not take long to catch up, with the R Keras package released in 2017. This package essentially translates the familiar style of R to Python, allowing you to easily use the full power of Keras while enjoying the elegance of, for instance the “tidyverse” style of programming.
While it is worthwhile to work once through the pain of coding a neural network by hand to make sure you fully understand what you’re doing, in everyday life using Keras will save you time, errors, and frustration.
Say we want a fully connected network of a few layers, for a regression problem (N.B. with many regression problems you want to try a much simpler model first!).
We have a data set train_x, which we will feed into the network. First, one installs R and keras. Then type:
model <- keras_model_sequential() %>%
layer_dense(units = 64,
activation = "relu",
input_shape = dim(train_x)[2]) %>%
layer_dense(units = 64,
activation = "relu") %>%
layer_dense(units = 1) model %>% compile(
loss = "mse",
optimizer = optimizer_adam(),
metrics = list("mean_absolute_error")
)
model %>% fit(train_x,
train_y,
epochs = 20)
and that’s all you need to train your first neural network. When you consider that this is a model with 4609 parameters, it doesn’t look so difficult to code up!
Let’s unpack this a bit, because even if you’re not familiar with R, you can sort of see what the different bits of this code are doing. Part 1 is building the architecture, part 2 is telling the model how to learn, and part 3 is training the model.
If we want to evaluate our model, it is just as simple:
model %>% evaluate(test_x,
test_y)
There are a lot of inbuilt layers to choose from – basically all the standard tools are there – dense, convolutional, recurrent and so forth. If you have some non-standard operation to do in a layer, you can implement a function using layer_lambda()
square_function <- function(params){
a_square <- params^2
}
model <- keras_model_sequential() %>%
layer_dense(units = 64,
activation = "relu",
input_shape = dim(train_x)[2]) %>%
layer_lambda(square_function) %>%
layer_dense(units = 1)
You can also write more custom complicated layers, but some attention should be paid to this as sometimes in this case it is not possible to transfer as simply between the various back-ends that Keras offers.
The architecture here is standardized, so that, for instance if I wanted to write a 1-layer gated recurrent unit, it looks almost identical to our dense set-up
model <- keras_model_sequential() %>%
layer_gru(units = 32,
activation = "relu",
input_shape = list(NULL, dim(train_x)[[-1]])) %>%
layer_dense(units = 1)
This makes it especially easy to try out different methods, for instance switching between a GRU and an LSTM, by altering one line of code.
One of the major issues to fight with when it comes to neural networks is overfitting. This is when instead of learning general rules that work for the data you learn very specific rules that work for your training set, but are too specific and do not generalize to the test set or other new data you have coming in. There are several techniques to handle overfitting – two common being regularization and dropout.
Regularization comes in two forms – L1 where the cost added is proportional to the absolute value of the weight coefficients, and L2 where the cost is proportional to the square of the value of the weight coefficients (anyone familiar with ridge regression will recognise these terms – it works in exactly the same way). L1 regularization sets some of the weights to zero, while L2 regularization shrinks weights. One can of course always use a combination of the two
The other, perhaps more popular option is dropout. Dropout turns off neurons in layers randomly. This means that while the network is learning, it cannot rely too much on any one neuron, meaning the network has to spread out the learning more and can’t memorize induvial data points.
Its easy to implement both of these in R keras
model <- keras_model_sequential() %>%
layer_dense(units = 64,
activation = "relu",
input_shape = dim(train_x)[2]) %>%
layer_dense(units = 32,
kernel_regularizer = regularizer_l2(0.001),
activation = "relu") %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 1)
As you train your model, it will likely eventually begin to overfit even with these measures in place. Typically, one want to get the model just before it starts overfitting. One way to do this is to periodically save the model as you train, and then grab the version saved before the overfitting happened. You can implement this automatically with callback_model_checkpoint()
Now you’ve trained up your model, how do you store it? You can easily save your model (weights and architecture) with
save_model_hdf5("model.h5")
load_model_hdf5("model.h5")
Often however, you may just want to store the weights as an .rda file.
weights <- get_weights(model)
save(weights, file=weights..rda)
Then when you want to restore your model you can load your weights, and use
set_weights(model, weights)
One can also use this as testing to set exact weights to your model – Note that if you mainly work with dataframes, you may have to brush up on subsetting lists and matrices.
Under some circumstance you may also wish to initialize the weights differently – this is likewise easy to do
model <- keras_model_sequential() %>%
layer_dense(units = 1,
activation = "relu",
input_shape = dim(train_x)[2],
kernel_initializer = 'orthogonal',
bias_initializer = initializer_constant(2.0)
)
This uses a orthogonal kernel initializer (starting with a random orthogonal matrix), and sets all bias terms to a constant (here, 2).
This is obviously not an extensive overview of Keras, just a small sample of some of the things I’ve found useful. Give it a go for yourself!
There’s lots more places to read in detail about all sort of networks.
https://keras.rstudio.com/ tutorials,
The R Keras “bible”: “Deep Learning with R” by François Chollet (Keras author) and J. J. Allaire (R interface writer)
and the accompanying code: https://github.com/jjallaire/deep-learning-with-r-notebooks
Subsetting, closures and super assignment operators: “Advanced R” by Hadley Wickham https://adv-r.hadley.nz/