By Alex Ough

In the previous blog (How To Build A Serverless Alert Email Notification System Using Machine Learning), I shared how to detect emails that need to be alerted among all incoming emails and to get notifications when emails are detected using Machine Learning, especially using Unsupervised Learning.

In Unsupervised Learning, input data (emails) are grouped by the topics that are automatically generated by the algorithm and it is impossible to change the groups when specific emails are assigned into wrong groups. To solve this disadvantage, I'd like to show how to improve the project using Supervised Learning.

There are lots of useful algorithms available to train a model using Supervised Learning, but I'd like to use pre-trained 'Universal Sentence Encoder' model from 'TensorFlow Hub' which enables you can train a model with a smaller dataset and speed up the training. 'TensorFlow Hub' is a library for the publication, discovery, and consumption of reusable parts of machine learning models, and 'Universal Sentence Encoder' encodes text into high dimensional vectors for various NLP (Natural Language Processing).

The model is trained and optimized for greater than word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. You can get more detail about this here.

To use the model, you need to install 'TensorFlow Hub' first using pip like below.

pip install tensorflow-hub

Now we're going to allocate the hub instance and see how it converts different lengths of texts into a fixed size vectors.

import  tensorflow as tf

import numpy as np

import tensorflow_hub as hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"

embed = hub.Module(module_url)

Let's try 3 different lengths of texts to see what we get as results.

word = "Elephant"

sentence = "I am a sentence for which I would like to get its embedding."

paragraph = (

    "Universal Sentence Encoder embeddings also support short paragraphs. "

    "There is no hard limit on how long the paragraph is. Roughly, the longer "

    "the more 'diluted' the embedding will be.")

messages = [word, sentence, paragraph]

print("\n")

with tf.Session() as session:

    session.run([tf.global_variables_initializer(), tf.tables_initializer()])

    message_embeddings = session.run(embed(messages))

    for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):

        print("Message: {}".format(messages[i]))

        print("Embedding size: {}".format(len(message_embedding)))

        message_embedding_snippet = ", ".join((str(x) for x in message_embedding[:3]))

        print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

As you see in the result, each has an array of 512 floating point numbers, which will be used as an input in our neural network. And you also can get the embed size from the module directly like this.

embed_size = embed.get_output_info_dict()['default'].get_shape()[1].value

embed_size

To proceed with training, let's get the same dataset we used in a previous blog. The dataset was automatically grouped by the Unsupervised Learning algorithm that produced a number of topics with their proportion values (probability). Before training a model, we need to label each data with your pre-defined classification. You can use the topics the Unsupervised Learning algorithm generated or may adjust each manually to make it more accurate. I made some adjustments and labeled each email with either  'Billing_Alert', 'Service_Alert' or 'Non_Alert' labels.

import pandas as pd

df = pd.read_csv('./AS.US.AWScto.labeled.csv')

df

You can see how many data sets are classified for each label, like this:

df['label'].value_counts()

Please change the ‘label’ column values into category type and verify the number of labels (categories) with their titles.

df.label = df.label.astype('category')

category_counts = len(df.label.cat.categories)

category_counts

df.label.cat.categories

Now it is time to build train and test datasets from the original dataset with labels to feed them into the neural network model.

idx = int(df.shape[0] * 0.9)

df_train = df.loc[:idx, :]

df_test = df.loc[idx+1:, :]

 

train_text = df_train['text'].tolist()

train_text = np.array(train_text, dtype=object)[:, np.newaxis]

train_label = np.asarray(pd.get_dummies(df_train.label), dtype = np.int8)

 

test_text = df_test['text'].tolist()

test_text = np.array(test_text, dtype=object)[:, np.newaxis]

test_label = np.asarray(pd.get_dummies(df_test.label), dtype = np.int8)

 

Once both datasets are ready, let's build a neural network model to train using the allocated hub instance along with its hyper parameter values. Now the input data size is 512 (embed_size) and output size is 3 (category_counts). Here the neural network model has 3 internal layers, but you can change the number of layers as you prefer.

def UniversalEmbedding(x):

    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

 

import keras.layers as layers

from keras.models import Model

 

input_text = layers.Input(shape=(1,), dtype=tf.string)

embedding = layers.Lambda(UniversalEmbedding, output_shape=(embed_size,))(input_text)

dense = layers.Dense(256, activation='relu')(embedding)

dense_2 = layers.Dense(128, activation='relu')(dense)

dense_3 = layers.Dense(64, activation='relu')(dense_2)

pred = layers.Dense(category_counts, activation='softmax')(dense_3)

model = Model(inputs=[input_text], outputs=pred)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

After successfully creating a model, we can start training the model with the train and test datasets. Here, I set the batch size as 32 and the number of epoch as 20 for training, but you can choose any value you think appropriate.

from keras import backend as K

 

model_path = './model.h5'

echochs = 20

batch_size = 32

 

with tf.Session() as session:

    K.set_session(session)

    session.run(tf.global_variables_initializer())

    session.run(tf.tables_initializer())

    history = model.fit(train_text, 

           train_label,

           validation_data=(test_text, test_label),

           epochs=echochs,

           batch_size=batch_size)

    model.save_weights(model_path)

When the training is completed, the model will be saved in the location of ‘model_path’ and we can use it to predict the labels of new emails. (I removed actual email contents below for security purposes)

 

text_1 = """

"""

text_2 = """

"""

text_3 = """

"""

text_4 = """

"""

new_text = [text_1, text_2, text_3, text_4]

new_text = np.array(new_text, dtype=object)[:, np.newaxis]

with tf.Session() as session:

    K.set_session(session)

    session.run(tf.global_variables_initializer())

    session.run(tf.tables_initializer())

    model.load_weights(model_path) 

    predicts = model.predict(new_text, batch_size=batch_size)

predicts

The prediction result is stored in the ‘predicts’ variable and as you see, each input text has each label’s probability, but it doesn’t show each label title. Here is how to get the label that has the highest probability with its title.

categories = df_train.label.cat.categories.tolist()

predict_logits = predicts.argmax(axis=1)

predict_labels = [categories[logit] for logit in predict_logits]

predict_labels

Now we can see which text belongs to which label.

So far, I presented how to train a neural network model using ‘Universal Sentence Encoder' from 'TensorFlow Hub' to improve a prediction accuracy on new emails compared to the result by Unsupervised Learning. Here, one thing to note is that the additional libraries of ‘TensorFlow’ and ‘TensorFlow Hub’ are quite large, so it is not recommended that you create the original alert system using a Lambda function like we did in the previous blog. Instead, it will be appropriate to containerize the model and provide API endpoints for prediction. There are a few ways to implement this. Using SageMaker will be one of the ways you should try if you don’t have much knowledge on containers.

Related Articles