How to Create a Custom Deep Learning Model Using TensorFlow 2.x

In this article you’ll learn:

  1. How to write custom deep learning model in TensorFlow 2.x.
  2. TensorFlow functional APIs.
  3. How to create custom loss function.
  4. Custom training using Gradient tape with callbacks.

Before starting this article, you must need knowledge of Python, TensorFlow 2.x, Intuition to Artificial Neural Networks.

From a long time, I was thinking to write an article in my leisure time about how to create a custom deep learning model (DNN) or artificial neural network using TensorFlow 2.x. Sometime, only sequential models are not enough or we found that there is another more accurate model for our goal which published recently in research paper and we can’t code that one using TensorFlow sequential API and this is the time when we need to create custom model using TensorFlow functional APIs with its auto-gradient tape.

In this article we will create a Joint Neural Collaborative Filtering (J-NCF) model which you can found in the following research paper Joint Neural Collaborative Filtering for Recommender Systems. J-NCF is a collaborative filtering model for recommendation systems.
Let’s start it.

Libraries

For this model, we are going to use following Python libraries.

  1. NumPy (V 1.19.5) For mathematical operations.
  2. Pandas (V 1.1.5) For dataset handling.
  3. Matplotlib (V 3.2.2) For visualizations.
  4. TensorFlow (V 2.4.1) For deep learning
  5. Scikit-Learn (V 0.22.2) We’ll use train-test splitter and mean absolute error metric of this library

Let first import all these libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import warnings # To handle warnings

Note: GPU is recommended for the training of this model. But for learning purpose, you can train it on CPU as well.

Dataset

For J-NCF training, we are going to use MoviesLens ml-latest-small dataset which you can found here or directly download from ml-latest-small.zip. This dataset will be enough for this article and it’s description is as following:

Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

For model training purpose, ratings.csv from above dataset file will be enough. So we are going to use only that file. Let first import it using Pandas

dataset = pd.read_csv('path_to_dateset_directory/ratings.csv')
dataset

This file has rating of 9,000+ movies with rating values between 0.5–5 with step size of 0.5 by 610 users. Lets look at the rating distributions by users

dataset['rating'].plot.hist()
Ratings distribution in dataset given by users

As we can see here, more movies has rated as 3, 4, 3.5 and 5. Next we’ll drop the column timestamp as we don’t need of this and let round off the ratings into figures 1,2,3,4,5.

dataset = dataset.drop('timestamp', axis=1).round().astype(int)

Let say U is the number of users and and I is the number of items. Next, we are going to split the dataset into training and test sets with test size of 20% of whole data.

train_set, test_set = train_test_split(dataset, test_size=0.2, random_state=None)

For our model, next we need to make some feature engineering on data. In our J-NCF, we have two neural networks one is Deep Feature (DF) model and other is Deep Interaction (DI) model. For DF, we’ll run two NN in parallel one is NetUser for user gives ratings to items and second NetItems for items rated by users and at the end of both these network we’ll concatenate them and then this layer will be served as an input of DI network. So we need two input datasets. It is not necessary to understand it, because main target is to understand how to code custom neural network, so this part is not necessary. For this, we’ll transform our dataset into two matrices

  1. Users in row and items in column
  2. Items in row and users in column, or we can say that transpose of first.

First define function to transform our data into matrices

def to_matrix_df(data):
dataset_array = data.pivot_table(index='userId', columns='movieId', values='rating')
users = dataset_array.index.to_numpy()
items = dataset_array.columns.to_numpy()
dataset_array = dataset_array.to_numpy()
dataset_array[np.isnan(dataset_array)] = 0
return pd.DataFrame(dataset_array.astype(np.int8), index=users, columns=items)

Now create matrices for

# Matrices for input
R_ui = to_matrix_df(train_set)
R_iu = R_ui.T
# Sizes of Users and Items
U, I = R_ui.shape

Lets look at R_ui

R_ui with 610 users in row and 8931 items in columns

0 meaning no rating given by user to item. As, we can see in R_ui.loc[1,3] = 4 meaning user with id 1 has rated 4 to movie with id 3.

J-NCF Model Architecture

Now, we are going to model our J-NCF. Let first look at the model architect

Image is cropped from original research paper of J-NCF mentioned above. Black arrows indicates forward propagation while red for back propagation.

Let me introduce you about it, I’ll not describe about the model in depth and if you’re interested then you can read the full research paper mentioned above. J-NCF model is a collaborative filtering model which learns between user and items interactions and then gives us recommendations for those items which users have not seen or viewed or rated or whatever our criteria is. In J-NCF, we have two models connected named

  1. Deep Features (DF) model where we will send input data.
  2. Deep Interactions (DI) model where we will get output of J-NCF.

Both the networks are connected where output of DF model will be served as the input of DI model. DF model based on two parallel neural networks (NN)

  1. One is NetUser where input size will be the number of items i.e. I. In this model, we’ll send user rated movies array and if there is unrated movie then that value will be 0.
  2. Second is NetItems where input size will be the number of users i.e. U. This model will get the item array rated by each user and if some user has not rated the movie then that value will be 0.

As we can see that and if you’re familiar with TensorFlow sequential APIs, you should observe that we can’t create above model using sequential APIs and this is the point where we’ll use functional APIs. TensorFlow functional APIs are more flexible and gives more access to handle layers, concatenate them at any position instead of sequence, make connections of layers anywhere etc. If you’re looking to create your own kind of layers, then you should look at the following documentation Implementing custom layers. Now lets code this model

DF Model

First, we will create DF model. Here, we will create two models in parallel, NetUser and NetItems with same depth and width. We’ll use l2 regularization without any dropout. Activation function used through whole network is relu instead of output layer. In this article, we’ll add two hidden layers for DF model with size 256 and 128.

regularization = 'l2'
layers = [256, 128]
layer_name = 'NetUser'
nu_input = tf.keras.Input(shape=(I,), name=f'{layer_name}_Input_{I}')
NetUser = tf.keras.layers.Dense(units=layers[0], activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_1_{layers[0]}')(nu_input)
NetUser = tf.keras.layers.Dense(units=layers[1], activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_2_{layers[1]}')(NetUser)
layer_name = 'NetItems'
ni_input = tf.keras.Input(shape=(U,), name=f'{layer_name}_Input_{U}')
NetItems = tf.keras.layers.Dense(units=layers[0], activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_1_{layers[0]}')(ni_input)
NetItems = tf.keras.layers.Dense(units=layers[1], activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_2_{layers[1]}')(NetItems)

DI Model

In this model, first we will concatenate output of both NetUser and NetItems which will be served as an input of DI model. There are otherways to combine two networks like multiplication but we’ll concatenate them.

DI = tf.keras.layers.Concatenate(name='DI_Network_Input')([NetUser, NetItems])

Since output of each above network is of size 128, so concatenated layer will be of size (neurons) 256.

Next, we’ll create DI model hidden layers and output layer. DI hidden layer will be only one of size 64 with same behavior as in DF model layers. DI model output layer activation will be sigmoid and we’ll not use bias for this layer.

layer_name = 'DI_Net'
DI = tf.keras.layers.Dense(units=64, activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_1_64')(DI)
DI = tf.keras.layers.Dense(units=1, activation='sigmoid', kernel_regularizer=regularization, use_bias=False, name='Sigmoid_Output')(DI)

Are you thinking why sigmoid, as ratings are in integers values and sigmoid output lies between 0–1?
Ans: We’ll normalize ratings data with mapping of f(x)=x/5 which will transform ratings values from 0–5 in 0–1. And in this way, we will be able to use sigmoid as an output layer where 1 means rating 5.

Finally, define model instance

model = tf.keras.Model(inputs=[nu_input, ni_input], outputs=DI)

Lets check the summary of the model

model.summary()
J-NCF model summary

Lets visualize it to make a view if model is constructed as we imagined

tf.keras.utils.plot_model(model)
J-NCF model plot

Great, we have created the model as we wished. Forward propagation part is now completed. Next, we’ve to work on Loss function and then back-propagation to update the weights.

Loss Function

In this NN, we’ll use mixed Top1 and BCE loss function which is not already defined by TensorFlow, so we’ll create our own. The loss function here is defined as

Where, we’ll combine two different kind of loss functions which are pairwise loss and pointwise loss. Third term in above loss function is L2 regularization which we used in layers.

In this loss function the pointwise loss is a binary cross entropy loss. While the pairwise loss is a top1 loss function which is defined as following

Here, y_uj are predicted values on negative items (those items which has not rated by users) while y_ui are predicted values of positive items (items which has rated by users). You don’t need to fall in depth of positive and negative items, just look how I’ll code this loss function and then use in custom training to update weights.

Lets code it.

def loss_top1(model, inputs_pos, inputs_neg, outputs_pos, alpha, lamda = 0.1, sample_weight=[1,1]):
preds_pos = model(inputs_pos)
preds_neg = model(inputs_neg)
point_loss = tf.keras.losses.BinaryCrossentropy()
losses = (alpha * tf.math.reduce_mean((tf.keras.activations.sigmoid(preds_neg - preds_pos) + tf.keras.activations.sigmoid(tf.square(preds_neg)))) + (1 - alpha) * point_loss(tf.reshape(outputs_pos, shape=(1,-1)), tf.reshape(preds_pos, shape=(1,-1)), sample_weight=sample_weight)) + lamda * sum(model.losses)
return losses

Finally, we have defined our loss function and named it loss_top1 because we used top1 as a pairwise loss function. There are other pairwise functions which are discussed in research paper but in this article we’ll use top1.

Functions for Negative Items Sampling

Here I’m going to write some functions for negative sampling. You don’t need to understand it because our main target is to learn how custom training will happen.

def get_users_neg_items(data):
U = data['userId'].unique()
I = data['movieId'].unique()
df = pd.DataFrame(index=U, columns=['pos_counts', 'neg_items'])
for user_i in U:
user_pos_items = data[data['userId'] == user_i]['movieId'].values
with warnings.catch_warnings():
warnings.simplefilter('ignore')
df.loc[user_i, :] = [user_pos_items.shape[0], np.setdiff1d(I, user_pos_items)]
del user_pos_items
return df
users_neg_items = get_users_neg_items(train_set)def negative_sampling(users_neg_items):
neg_items_sample = list()
for _user in users_neg_items.values:
neg_items_sample.append(np.random.choice(_user[1], _user[0]))
return np.concatenate(neg_items_sample)

Test Set

During train-test split, it is possible that we might have skipped any movie or user in train-set. Suppose, if it happened, then there may be more items as we passed to input of NetUser model and thus the size of NetUser model will not match with the size of number of items. So, we’ve to drop those items.

Are you thinking, if we are dropping any item during training, then how we’ll found rating for that item in production?
Ans: In production, we’ll use whole set as a train set, so we’ll not drop any item and thus as much there are items, there will be number of input neurons in NetUser NN. But for model testing, we might comes need to drop some items/users which has rated by/to very few users/items and those items ratings are not available in train-set.

Lets drop such items or users

true_test_set = test_set.mask(lambda x: ~x['userId'].isin(R_ui.index) | ~x['movieId'].isin(R_iu.index), None).dropna()

Prediction Function

Lets code the function to make predictions as our model will make predictions between 0–1 and we need to convert them back to 0–5 figures.

def predict(model, data, users_interactions=R_ui, items_interactions=R_iu, batch_size = 20000, rating_format=True):
preds = list()
for start, end in zip(np.arange(0, data.shape[0], batch_size), np.arange(batch_size, data.shape[0]+batch_size, batch_size)):
batch = data.iloc[start:end]
_users = users_interactions.loc[batch['userId'].values].to_numpy()
_items = items_interactions.loc[batch['movieId'].values].to_numpy()
preds.append(model([_users, _items]))
del _users, _items
preds = np.concatenate(preds)
if rating_format:
preds = np.round(preds*5).astype(int)
else:
preds = np.round(preds, 2).astype(float)
return preds.reshape(-1,)

Finally, we are going to train our model.

Custom Training

I hope, you are well familiar with backpropagation concept and you’ve intuition about GradientTape. GradientTape is an auto gradient calculation method which is provided by TensorFlow to calculate gradients/derivatives without any much efforts. As in backpropagation, we go from loss function to backward while calculating each neuron derivatives w.r.t. their weights and biases but in TensorFlow, we don’t need to do all these efforts by ourselves and we can use GradientTape to do our whole job.

In training, we’ll also add a callback function to terminate training. If our loss function or train/test accuracy is not improving and after termination, it’ll keep and set the best weights to J-NCF.

Let first evaluate our model before training and see how much mean absolute errors are

loss_at_epoch = list()
train_accuracy = list()
test_accuracy = list()
train_accuracy.append(mean_absolute_error(train_set['rating'], predict(model, train_set)))
test_accuracy.append(mean_absolute_error(true_test_set['rating'], predict(model, true_test_set)))
best_train_accuracy = train_accuracy[-1]
best_test_accuracy = test_accuracy[-1]
print(f'''
======================= Initially =================================
Evaulation on
Train: {train_accuracy[-1]}
Test: {test_accuracy[-1]}
''')

Out:

Model evaluation before training

Let add some settings for training of model

epochs = 3
ng_samples_ns = 1
batch_size = 1000
termination_epochs = 2 ## To terminate training while accuracy/loss is not improving over train and test sets.

You can see that, I’ve defined a variable batch_size and this is because, I’m going to use batch training.

For the optimization, we’ll use Adam algorithm

optimizer = tf.keras.optimizers.Adam(learning_rate=0.00005, beta_1=0.99)

Let define some parameters which are necessary and will be update during training

batches_indices = np.arange(0,train_set.shape[0]+batch_size, batch_size)
indices = list(train_set.index)
not_improving_epochs = 0
first_time = True
best_weights = None

Finally, lets create training loop

for epoch in range(epochs):
print(f'****** Epoch: {epoch+1} ******')
print('\n')
np.random.shuffle(indices)
for ng_n in range(ng_samples_ns):
train_set[f'neg{ng_n}'] = negative_sampling(users_neg_items)
start_point = 0
for j, end_point in enumerate(batches_indices[1:]):
batch = train_set.loc[indices[start_point:end_point]]
user_i = R_ui.loc[batch['userId'].values, :].values
items_pos = R_iu.loc[batch['movieId'].values, :].values
outputs = (batch['rating'] / 5).values
if j%np.ceil(batches_indices.shape[0]/50) == 0:
print('\b'*len(str(j-1)) + f'={j}', end='')
for ng_n in range(ng_samples_ns):
items_neg = R_iu.loc[batch[f'neg{ng_n}'].values, :].values
with tf.GradientTape() as tape:
losses = loss_top1(model=model, inputs_pos=[user_i, items_pos], outputs_pos=outputs, inputs_neg=[user_i, items_neg], alpha=0.1, lamda=0.05)
del batch, user_i, items_pos, items_neg
while first_time:
loss_at_epoch.append(losses.numpy())
best_loss = loss_at_epoch[-1]
first_time = False
grads = tape.gradient(losses, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
start_point = end_point
loss_at_epoch.append(losses.numpy())
train_accuracy.append(mean_absolute_error(train_set['rating'], predict(model, train_set)))
test_accuracy.append(mean_absolute_error(true_test_set['rating'], predict(model, true_test_set)))
best = False
if (loss_at_epoch[-1] < best_loss):
best_weights = model.get_weights()
best_train_accuracy = train_accuracy[-1]
best_test_accuracy = test_accuracy[-1]
best_loss = loss_at_epoch[-1]
not_improving_epochs = 0
best = True
else:
not_improving_epochs += 1
if not_improving_epochs == termination_epochs:
model.set_weights(best_weights)
break
print(f'''
Loss: {loss_at_epoch[-1]}
Evaulation on
Train: {train_accuracy[-1]}
Test: {test_accuracy[-1]}
Best: {'Yes' if best else 'No'}
''')

Let me explain each line step by step.

for epoch in range(epochs):
print(f'****** Epoch: {epoch+1} ******')
print('\n')

First of all, we started a loop for each epoch and just printed the epoch number.

np.random.shuffle(indices)
for ng_n in range(ng_samples_ns):
train_set[f'neg{ng_n}'] = negative_sampling(users_neg_items)

In the next, first we shuffled the indices of train-set indices (selected indexes of train-set) so that in each epoch, we randomly select the train set data for model training instead of sequence.
Next, we defined a negative sampling i.e. when we’ll send a user history with a user rated movie history to DF model, then we’ll also randomly select a user non-rated movie for loss function. It’s mean that, in training data, there will be same number of non-rated items as much there are rated items for each user.

  start_point = 0
for j, end_point in enumerate(batches_indices[1:]):
batch = train_set.loc[indices[start_point:end_point]]
user_i = R_ui.loc[batch['userId'].values, :].values
items_pos = R_iu.loc[batch['movieId'].values, :].values
outputs = (batch['rating'] / 5).values

Here, as we are training our model in batches so we started here batches. We selected usesr_i for NetUser NN and items_pos for NetItems NN whereas outputs are the ratings given by user_i to items_pos. You must observe that, here we’ve have normalized the ratings data.

In J-NCF, we can select multiple samples for negative items i.e. for each positive items it is also possible to send different negative items for a user. But for this article, we only selected one sample i.e. for each user positive (rated) item we’ll select only one negative (non-rated) item.

    for ng_n in range(ng_samples_ns):
items_neg = R_iu.loc[batch[f'neg{ng_n}'].values, :].values

So this loop will only run once.

with tf.GradientTape() as tape:
losses = loss_top1(model=model, inputs_pos=[user_i, items_pos], outputs_pos=outputs, inputs_neg=[user_i, items_neg], alpha=0.1, lamda=0.05)

Next, we initiated GradientTape. It is necessary to define function inside GradientTape so that it can calculate whole function graph. It is also necessary to calculate losses inside GradientTape, and if you’ll execute NN eagerly then the TensorFlow graph will be destroyed. The NN graph is necessary to calculate gradients of whole NN in backward, because if NN graph are destroyed and instead of graph we just have numerical values then TensorFlow will not know the NN architect, so GradientTape will not be able to calculate backward gradients (derivatives). So remember that, whenever you’re creating you’re custom model, don’t execute it eagerly (earlier), just create the NN architect (graph) i.e. everything inside def and then execute it once inside GradientTape.
You should observe that here we have selected the values for alpha = 0.1 and lambda = 0.05.
As we have run our NN and calculated loss inside GradientTape without destroying any graph, now we can calculate gradients of loss w.r.t. each neuron weights and biases.

     grads = tape.gradient(losses, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

We calculated grads using defined GradientTape and then applies those gradients using Adam optimizer. After this stage, we have updated the newly learned weights of our whole NN. Once the model will complete all batches, next we’ll add a callback to terminate NN training if it is not improving at the end of each epoch.

loss_at_epoch.append(losses.numpy())
train_accuracy.append(mean_absolute_error(train_set['rating'], predict(model, train_set)))
test_accuracy.append(mean_absolute_error(true_test_set['rating'], predict(model, true_test_set)))

We have calculated train and test sets mean absolute errors (MAE) after predicting the ratings with our trained model.

  best = False
if (loss_at_epoch[-1] < best_loss):
best_weights = model.get_weights()
best_train_accuracy = train_accuracy[-1]
best_test_accuracy = test_accuracy[-1]
best_loss = loss_at_epoch[-1]
not_improving_epochs = 0
best = True
else:
not_improving_epochs += 1
if not_improving_epochs == termination_epochs:
model.set_weights(best_weights)
break

At the end, we have defined the callback to terminate the training. You should observe that, we are testing improving over loss value, i.e. if loss is not improving in consecutive 2 epochs, then the whole training will terminate and the best weights will be set to model.

Lets start this whole training and see output

Training J-NCF model in 3 epochs

If you can scroll up and see that, before the training the train and test MAE were 1.94 and 1.96 respectively. You must observe that, during training loss and MAE improved in each epoch. At the end of the training, train and test MAE has improved to 0.67 which is great.

Finally, our model has learned well and optimized the MAE from 1.96 to 0.67. This is great and we’ve learned that how we can create custom NN model whatever the NN architect is.

I hope you enjoyed and learned from this article. If you’ve any confusion anywhere just comment it and I’ll explain you. Don’t forget to give clap. Thanks

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Syed Muhammad Asad

MS (Computational Mathematics), Data Scientist and Machine Learning Engineer, Mathematician, Programmer, Research Scientist, Writer.