# How to Create a Custom Deep Learning Model Using TensorFlow 2.x

In this article you’ll learn:

- How to write custom deep learning model in TensorFlow 2.x.
- TensorFlow functional APIs.
- How to create custom loss function.
- Custom training using Gradient tape with callbacks.

Before starting this article, you must need knowledge of Python, TensorFlow 2.x, Intuition to Artificial Neural Networks.

From a long time, I was thinking to write an article in my leisure time about how to create a custom deep learning model (DNN) or artificial neural network using TensorFlow 2.x. Sometime, only sequential models are not enough or we found that there is another more accurate model for our goal which published recently in research paper and we can’t code that one using TensorFlow sequential API and this is the time when we need to create custom model using TensorFlow functional APIs with its auto-gradient tape.

In this article we will create a Joint Neural Collaborative Filtering (J-NCF) model which you can found in the following research paper Joint Neural Collaborative Filtering for Recommender Systems. J-NCF is a collaborative filtering model for recommendation systems.

Let’s start it.

# Libraries

For this model, we are going to use following Python libraries.

- NumPy (V 1.19.5) For mathematical operations.
- Pandas (V 1.1.5) For dataset handling.
- Matplotlib (V 3.2.2) For visualizations.
- TensorFlow (V 2.4.1) For deep learning
- Scikit-Learn (V 0.22.2) We’ll use train-test splitter and mean absolute error metric of this library

Let first import all these libraries:

`import numpy as np`

import pandas as pd

import matplotlib.pyplot as plt

import tensorflow as tf

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error

import warnings # To handle warnings

*Note: GPU is recommended for the training of this model. But for learning purpose, you can train it on CPU as well.*

# Dataset

For J-NCF training, we are going to use MoviesLens *ml-latest-small* dataset which you can found here or directly download from ml-latest-small.zip. This dataset will be enough for this article and it’s description is as following:

Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

For model training purpose, *ratings.csv* from above dataset file will be enough. So we are going to use only that file. Let first import it using Pandas

`dataset = pd.read_csv('path_to_dateset_directory/ratings.csv')`

This file has rating of 9,000+ movies with rating values between 0.5–5 with step size of 0.5 by 610 users. Lets look at the rating distributions by users

`dataset['rating'].plot.hist()`

As we can see here, more movies has rated as 3, 4, 3.5 and 5. Next we’ll drop the column *timestamp* as we don’t need of this and let round off the ratings into figures 1,2,3,4,5.

`dataset = dataset.drop('timestamp', axis=1).round().astype(int)`

Let say *U* is the number of users and and *I* is the number of items. Next, we are going to split the *dataset* into training and test sets with test size of 20% of whole data.

`train_set, test_set = train_test_split(dataset, test_size=0.2, random_state=None)`

For our model, next we need to make some feature engineering on data. In our J-NCF, we have two neural networks one is Deep Feature (DF) model and other is Deep Interaction (DI) model. For DF, we’ll run two NN in parallel one is *NetUser* for user gives ratings to items and second *NetItems* for items rated by users and at the end of both these network we’ll concatenate them and then this layer will be served as an input of DI network. So we need two input datasets. It is not necessary to understand it, because main target is to understand how to code custom neural network, so this part is not necessary. For this, we’ll transform our *dataset* into two matrices

- Users in row and items in column
- Items in row and users in column, or we can say that transpose of first.

First define function to transform our data into matrices

`def to_matrix_df(data):`

dataset_array = data.pivot_table(index='userId', columns='movieId', values='rating')

users = dataset_array.index.to_numpy()

items = dataset_array.columns.to_numpy()

dataset_array = dataset_array.to_numpy()

dataset_array[np.isnan(dataset_array)] = 0

return pd.DataFrame(dataset_array.astype(np.int8), index=users, columns=items)

Now create matrices for

`# Matrices for input`

R_ui = to_matrix_df(train_set)

R_iu = R_ui.T

# Sizes of Users and Items

U, I = R_ui.shape

Lets look at *R_ui*

0 meaning no rating given by user to item. As, we can see in *R_ui.loc[1,3] = 4* meaning user with id 1 has rated 4 to movie with id 3.

# J-NCF Model Architecture

Now, we are going to model our J-NCF. Let first look at the model architect

Let me introduce you about it, I’ll not describe about the model in depth and if you’re interested then you can read the full research paper mentioned above. J-NCF model is a collaborative filtering model which learns between user and items interactions and then gives us recommendations for those items which users have not seen or viewed or rated or whatever our criteria is. In J-NCF, we have two models connected named

- Deep Features (DF) model where we will send input data.
- Deep Interactions (DI) model where we will get output of J-NCF.

Both the networks are connected where output of DF model will be served as the input of DI model. DF model based on two parallel neural networks (NN)

- One is NetUser where input size will be the number of items i.e.
*I*. In this model, we’ll send user rated movies array and if there is unrated movie then that value will be 0. - Second is NetItems where input size will be the number of users i.e.
*U*. This model will get the item array rated by each user and if some user has not rated the movie then that value will be 0.

As we can see that and if you’re familiar with TensorFlow sequential APIs, you should observe that we can’t create above model using sequential APIs and this is the point where we’ll use functional APIs. TensorFlow functional APIs are more flexible and gives more access to handle layers, concatenate them at any position instead of sequence, make connections of layers anywhere etc. If you’re looking to create your own kind of layers, then you should look at the following documentation Implementing custom layers. Now lets code this model

**DF Model**

First, we will create DF model. Here, we will create two models in parallel, *NetUser* and *NetItems* with same depth and width. We’ll use *l2* regularization without any dropout. Activation function used through whole network is *relu* instead of output layer. In this article, we’ll add two hidden layers for DF model with size 256 and 128.

regularization = 'l2'

layers = [256, 128]layer_name = 'NetUser'

nu_input = tf.keras.Input(shape=(I,), name=f'{layer_name}_Input_{I}')

NetUser = tf.keras.layers.Dense(units=layers[0], activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_1_{layers[0]}')(nu_input)

NetUser = tf.keras.layers.Dense(units=layers[1], activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_2_{layers[1]}')(NetUser)layer_name = 'NetItems'

ni_input = tf.keras.Input(shape=(U,), name=f'{layer_name}_Input_{U}')

NetItems = tf.keras.layers.Dense(units=layers[0], activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_1_{layers[0]}')(ni_input)

NetItems = tf.keras.layers.Dense(units=layers[1], activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_2_{layers[1]}')(NetItems)

**DI Model**

In this model, first we will concatenate output of both *NetUser* and *NetItems* which will be served as an input of DI model. There are otherways to combine two networks like multiplication but we’ll concatenate them.

`DI = tf.keras.layers.Concatenate(name='DI_Network_Input')([NetUser, NetItems])`

Since output of each above network is of size 128, so concatenated layer will be of size (neurons) 256.

Next, we’ll create DI model hidden layers and output layer. DI hidden layer will be only one of size 64 with same behavior as in DF model layers. DI model output layer activation will be sigmoid and we’ll not use bias for this layer.

`layer_name = 'DI_Net'`

DI = tf.keras.layers.Dense(units=64, activation='relu', kernel_regularizer=regularization, name=f'{layer_name}_layer_1_64')(DI)

DI = tf.keras.layers.Dense(units=1, activation='sigmoid', kernel_regularizer=regularization, use_bias=False, name='Sigmoid_Output')(DI)

*Are you thinking why sigmoid, as ratings are in integers values and sigmoid output lies between 0–1?Ans: We’ll normalize ratings data with mapping of f(x)=x/5 which will transform ratings values from 0–5 in 0–1. And in this way, we will be able to use sigmoid as an output layer where 1 means rating 5.*

Finally, define model instance

`model = tf.keras.Model(inputs=[nu_input, ni_input], outputs=DI)`

Lets check the summary of the model

`model.summary()`

Lets visualize it to make a view if model is constructed as we imagined

`tf.keras.utils.plot_model(model)`

Great, we have created the model as we wished. Forward propagation part is now completed. Next, we’ve to work on Loss function and then back-propagation to update the weights.

**Loss Function**

In this NN, we’ll use mixed Top1 and BCE loss function which is not already defined by TensorFlow, so we’ll create our own. The loss function here is defined as

Where, we’ll combine two different kind of loss functions which are pairwise loss and pointwise loss. Third term in above loss function is L2 regularization which we used in layers.

In this loss function the pointwise loss is a binary cross entropy loss. While the pairwise loss is a top1 loss function which is defined as following

Here, *y_uj* are predicted values on negative items (those items which has not rated by users) while *y_ui* are predicted values of positive items (items which has rated by users). You don’t need to fall in depth of positive and negative items, just look how I’ll code this loss function and then use in custom training to update weights.

Lets code it.

`def loss_top1(model, inputs_pos, inputs_neg, outputs_pos, alpha, lamda = 0.1, sample_weight=[1,1]):`

preds_pos = model(inputs_pos)

preds_neg = model(inputs_neg)

point_loss = tf.keras.losses.BinaryCrossentropy()

losses = (alpha * tf.math.reduce_mean((tf.keras.activations.sigmoid(preds_neg - preds_pos) + tf.keras.activations.sigmoid(tf.square(preds_neg)))) + (1 - alpha) * point_loss(tf.reshape(outputs_pos, shape=(1,-1)), tf.reshape(preds_pos, shape=(1,-1)), sample_weight=sample_weight)) + lamda * sum(model.losses)

return losses

Finally, we have defined our loss function and named it *loss_top1* because we used top1 as a pairwise loss function. There are other pairwise functions which are discussed in research paper but in this article we’ll use top1.

**Functions for Negative Items Sampling**

Here I’m going to write some functions for negative sampling. You don’t need to understand it because our main target is to learn how custom training will happen.

def get_users_neg_items(data):

U = data['userId'].unique()

I = data['movieId'].unique()

df = pd.DataFrame(index=U, columns=['pos_counts', 'neg_items'])

for user_i in U:

user_pos_items = data[data['userId'] == user_i]['movieId'].values

with warnings.catch_warnings():

warnings.simplefilter('ignore')

df.loc[user_i, :] = [user_pos_items.shape[0], np.setdiff1d(I, user_pos_items)]

del user_pos_items

return dfusers_neg_items = get_users_neg_items(train_set)def negative_sampling(users_neg_items):

neg_items_sample = list()

for _user in users_neg_items.values:

neg_items_sample.append(np.random.choice(_user[1], _user[0]))

return np.concatenate(neg_items_sample)

**Test Set**

During train-test split, it is possible that we might have skipped any movie or user in train-set. Suppose, if it happened, then there may be more items as we passed to input of NetUser model and thus the size of NetUser model will not match with the size of number of items. So, we’ve to drop those items.

*Are you thinking, if we are dropping any item during training, then how we’ll found rating for that item in production?Ans: In production, we’ll use whole set as a train set, so we’ll not drop any item and thus as much there are items, there will be number of input neurons in NetUser NN. But for model testing, we might comes need to drop some items/users which has rated by/to very few users/items and those items ratings are not available in train-set.*

Lets drop such items or users

`true_test_set = test_set.mask(lambda x: ~x['userId'].isin(R_ui.index) | ~x['movieId'].isin(R_iu.index), None).dropna()`

**Prediction Function**

Lets code the function to make predictions as our model will make predictions between 0–1 and we need to convert them back to 0–5 figures.

`def predict(model, data, users_interactions=R_ui, items_interactions=R_iu, batch_size = 20000, rating_format=True):`

preds = list()

for start, end in zip(np.arange(0, data.shape[0], batch_size), np.arange(batch_size, data.shape[0]+batch_size, batch_size)):

batch = data.iloc[start:end]

_users = users_interactions.loc[batch['userId'].values].to_numpy()

_items = items_interactions.loc[batch['movieId'].values].to_numpy()

preds.append(model([_users, _items]))

del _users, _items

preds = np.concatenate(preds)

if rating_format:

preds = np.round(preds*5).astype(int)

else:

preds = np.round(preds, 2).astype(float)

return preds.reshape(-1,)

Finally, we are going to train our model.

# Custom Training

I hope, you are well familiar with backpropagation concept and you’ve intuition about *GradientTape*. *GradientTape* is an auto gradient calculation method which is provided by TensorFlow to calculate gradients/derivatives without any much efforts. As in backpropagation, we go from loss function to backward while calculating each neuron derivatives w.r.t. their weights and biases but in TensorFlow, we don’t need to do all these efforts by ourselves and we can use *GradientTape* to do our whole job.

In training, we’ll also add a callback function to terminate training. If our loss function or train/test accuracy is not improving and after termination, it’ll keep and set the best weights to J-NCF.

Let first evaluate our model before training and see how much mean absolute errors are

`loss_at_epoch = list()`

train_accuracy = list()

test_accuracy = list()

train_accuracy.append(mean_absolute_error(train_set['rating'], predict(model, train_set)))

test_accuracy.append(mean_absolute_error(true_test_set['rating'], predict(model, true_test_set)))

best_train_accuracy = train_accuracy[-1]

best_test_accuracy = test_accuracy[-1]

print(f'''

======================= Initially =================================

Evaulation on

Train: {train_accuracy[-1]}

Test: {test_accuracy[-1]}

''')

Out:

Let add some settings for training of model

`epochs = 3`

ng_samples_ns = 1

batch_size = 1000

termination_epochs = 2 ## To terminate training while accuracy/loss is not improving over train and test sets.

You can see that, I’ve defined a variable *batch_size* and this is because, I’m going to use batch training.

For the optimization, we’ll use Adam algorithm

`optimizer = tf.keras.optimizers.Adam(learning_rate=0.00005, beta_1=0.99)`

Let define some parameters which are necessary and will be update during training

`batches_indices = np.arange(0,train_set.shape[0]+batch_size, batch_size)`

indices = list(train_set.index)

not_improving_epochs = 0

first_time = True

best_weights = None

Finally, lets create training loop

`for epoch in range(epochs):`

print(f'****** Epoch: {epoch+1} ******')

print('\n')

np.random.shuffle(indices)

for ng_n in range(ng_samples_ns):

train_set[f'neg{ng_n}'] = negative_sampling(users_neg_items)

start_point = 0

for j, end_point in enumerate(batches_indices[1:]):

batch = train_set.loc[indices[start_point:end_point]]

user_i = R_ui.loc[batch['userId'].values, :].values

items_pos = R_iu.loc[batch['movieId'].values, :].values

outputs = (batch['rating'] / 5).values

if j%np.ceil(batches_indices.shape[0]/50) == 0:

print('\b'*len(str(j-1)) + f'={j}', end='')

for ng_n in range(ng_samples_ns):

items_neg = R_iu.loc[batch[f'neg{ng_n}'].values, :].values

with tf.GradientTape() as tape:

losses = loss_top1(model=model, inputs_pos=[user_i, items_pos], outputs_pos=outputs, inputs_neg=[user_i, items_neg], alpha=0.1, lamda=0.05)

del batch, user_i, items_pos, items_neg

while first_time:

loss_at_epoch.append(losses.numpy())

best_loss = loss_at_epoch[-1]

first_time = False

grads = tape.gradient(losses, model.trainable_variables)

optimizer.apply_gradients(zip(grads, model.trainable_variables))

start_point = end_point

loss_at_epoch.append(losses.numpy())

train_accuracy.append(mean_absolute_error(train_set['rating'], predict(model, train_set)))

test_accuracy.append(mean_absolute_error(true_test_set['rating'], predict(model, true_test_set)))

best = False

if (loss_at_epoch[-1] < best_loss):

best_weights = model.get_weights()

best_train_accuracy = train_accuracy[-1]

best_test_accuracy = test_accuracy[-1]

best_loss = loss_at_epoch[-1]

not_improving_epochs = 0

best = True

else:

not_improving_epochs += 1

if not_improving_epochs == termination_epochs:

model.set_weights(best_weights)

break

print(f'''

Loss: {loss_at_epoch[-1]}

Evaulation on

Train: {train_accuracy[-1]}

Test: {test_accuracy[-1]}

Best: {'Yes' if best else 'No'}

''')

Let me explain each line step by step.

`for epoch in range(epochs):`

print(f'****** Epoch: {epoch+1} ******')

print('\n')

First of all, we started a loop for each epoch and just printed the epoch number.

`np.random.shuffle(indices)`

for ng_n in range(ng_samples_ns):

train_set[f'neg{ng_n}'] = negative_sampling(users_neg_items)

In the next, first we shuffled the indices of *train-set* indices (selected indexes of *train-set*) so that in each epoch, we randomly select the train set data for model training instead of sequence.

Next, we defined a negative sampling i.e. when we’ll send a user history with a user rated movie history to DF model, then we’ll also randomly select a user non-rated movie for loss function. It’s mean that, in training data, there will be same number of non-rated items as much there are rated items for each user.

` start_point = 0`

for j, end_point in enumerate(batches_indices[1:]):

batch = train_set.loc[indices[start_point:end_point]]

user_i = R_ui.loc[batch['userId'].values, :].values

items_pos = R_iu.loc[batch['movieId'].values, :].values

outputs = (batch['rating'] / 5).values

Here, as we are training our model in batches so we started here batches. We selected *usesr_i* for NetUser NN and *items_pos* for NetItems NN whereas *outputs* are the ratings given by *user_i* to *items_pos*. You must observe that, here we’ve have normalized the ratings data.

In J-NCF, we can select multiple samples for negative items i.e. for each positive items it is also possible to send different negative items for a user. But for this article, we only selected one sample i.e. for each user positive (rated) item we’ll select only one negative (non-rated) item.

` for ng_n in range(ng_samples_ns):`

items_neg = R_iu.loc[batch[f'neg{ng_n}'].values, :].values

So this loop will only run once.

`with tf.GradientTape() as tape:`

losses = loss_top1(model=model, inputs_pos=[user_i, items_pos], outputs_pos=outputs, inputs_neg=[user_i, items_neg], alpha=0.1, lamda=0.05)

Next, we initiated *GradientTape*. It is necessary to define function inside *GradientTape* so that it can calculate whole function graph. It is also necessary to calculate losses inside *GradientTape*, and if you’ll execute NN eagerly then the TensorFlow graph will be destroyed. The NN graph is necessary to calculate gradients of whole NN in backward, because if NN graph are destroyed and instead of graph we just have numerical values then TensorFlow will not know the NN architect, so *GradientTape* will not be able to calculate backward gradients (derivatives). So remember that, whenever you’re creating you’re custom model, don’t execute it eagerly (earlier), just create the NN architect (graph) i.e. everything inside *def* and then execute it once inside *GradientTape*.

You should observe that here we have selected the values for *alpha = 0.1* and *lambda = 0.05*.

As we have run our NN and calculated loss inside *GradientTape* without destroying any graph, now we can calculate gradients of loss w.r.t. each neuron weights and biases.

` grads = tape.gradient(losses, model.trainable_variables)`

optimizer.apply_gradients(zip(grads, model.trainable_variables))

We calculated grads using defined *GradientTape* and then applies those gradients using Adam optimizer. After this stage, we have updated the newly learned weights of our whole NN. Once the model will complete all batches, next we’ll add a callback to terminate NN training if it is not improving at the end of each epoch.

`loss_at_epoch.append(losses.numpy())`

train_accuracy.append(mean_absolute_error(train_set['rating'], predict(model, train_set)))

test_accuracy.append(mean_absolute_error(true_test_set['rating'], predict(model, true_test_set)))

We have calculated train and test sets mean absolute errors (MAE) after predicting the ratings with our trained model.

` best = False`

if (loss_at_epoch[-1] < best_loss):

best_weights = model.get_weights()

best_train_accuracy = train_accuracy[-1]

best_test_accuracy = test_accuracy[-1]

best_loss = loss_at_epoch[-1]

not_improving_epochs = 0

best = True

else:

not_improving_epochs += 1

if not_improving_epochs == termination_epochs:

model.set_weights(best_weights)

break

At the end, we have defined the callback to terminate the training. You should observe that, we are testing improving over loss value, i.e. if loss is not improving in consecutive 2 epochs, then the whole training will terminate and the best weights will be set to model.

Lets start this whole training and see output

If you can scroll up and see that, before the training the train and test MAE were 1.94 and 1.96 respectively. You must observe that, during training loss and MAE improved in each epoch. At the end of the training, train and test MAE has improved to 0.67 which is great.

Finally, our model has learned well and optimized the MAE from 1.96 to 0.67. This is great and we’ve learned that how we can create custom NN model whatever the NN architect is.

I hope you enjoyed and learned from this article. If you’ve any confusion anywhere just comment it and I’ll explain you. Don’t forget to give clap. Thanks