# How to Classify Emails using k-NN (k-nearest neighbors) and WordNet

August 30, 2019

As a leading cloud & disaster recovery service provider, my company receives request emails related to IT operations daily. To support these requests, our Operations team members need to sort the request emails into several categories (depending on the customers) along with their content and create service tickets accordingly. But it is all manual work, which is error-prone and time consuming. In this blog post, I’ll outline how I automated this process using the k-NN clustering algorithms and the path similarity method of synset in WordNet, which will be explained in more detail later.

Before digging into the automated system, let me explain the k-NN algorithm. The k-nearest neighbors algorithm (k-NN) is used for classification and regression. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

k-NN is a type of instance-based learning, or “lazy learning,” where the function is only approximated locally and all computation is deferred until classification. The neighbors are taken from a set of objects. This can be thought of as the training set for the algorithm, though no explicit training step is required.

Now, let’s go over the system. I’m using a Jupyter Notebook for code presentation, and this is the code to import libraries that are necessary to begin.

To build a training data (k-NN is a lazy learning that does not need any training as mentioned above, but let’s call the data with labels as “training data” for convenience), let’s get a certain amount of old emails. Here we will use only ‘subject’ text of each email as feature data for the classification.

For better and efficient classification, we need to clean up the ‘subject’ texts to remove words that do not affect our classification. The “subject” texts are usually not normal sentences that have subject / verb / object, but rather contracted ones that don’t have ordinary stop words, so we’ll just remove non-alphanumeric letters and change all words to lowercases. You can see the final pre-processed text after the steps are completed.

To determine the closeness among the objects, this system uses a path similarity method of synset in WordNet. The WordNet is a lexical database for English where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. The synset is a set of synonyms that share a common meaning, and they are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated along the path that is used to measure the closeness of each word. For more detail, please see  https://wordnet.princeton.edu/.

We’ll build a synset dictionary from all pre-processed results of email ‘subject’ texts using NLTK (Natural Language Toolkit) library to build a training data. The synsets of each pre-processed text in the dictionary will be used to measure the closeness to the data to classify.

This is how to generate WordNet synsets of a specific text using NLTK:

After tokenizing the cleaned text (we’ll use the above pre-processed subject text as an input text),

Select one of the tokens (I chose ‘maintenance’) and generate its synset list. A synset is identified with a 3-part name of the form, such as “word.pos.nn” where ‘pos (part of speech)’ is a particular grammatical class of word such as noun, adjective, or verb, and ‘nn’ is an index because a word can have multiple meanings. You can see the list of synonyms in a 3-part name form:

And you can see the meaning of each synset using ‘definition method.

If we try another token -- ‘affecting’ -- we can see the list of synsets whose ‘pos’ values are ‘v’ instead of ‘n’:

When we classify the email, we want to remove the names of the month (‘January’, ’February,’ etc.) and potentially others because they do not affect the classification. However, we may need to compare each name with a few different words because, for example, any of ‘Feb’, ‘February’ can be used, which is not a simple task when you consider we have to check all 12 months. But it will be a little easier when we use the synset of them.

For example, this is what we get for ‘Feb’.

And this is what we get for ‘February’.

As you can see above, we can get the lemmatized words of both, which are the same, so we can check them when comparing given the words of each month.

Now, let’s see how to check the similarity of two different texts to measure the closeness for classification we want to get. You may think of simple similarity methods that compare just letters of both texts, but we want to consider the meaning of them also in addition to the location of words, and synsets are the best fit. One of the methods is ‘path_similarity’. It calculates the shortest path that connects two synsets. The score is in the range of 0 to 1. It will return ‘None’ if no path is found, and a score of 1 represents identity. After calculating a path score for each word in a text, we will combine them into a percentage value to generate the level of closeness. Please see here, http://www.nltk.org/howto/wordnet.html, for more methods to find similarity.

Let’s try with sample words.

Now to process all of the training data, these are the functions to generate the synsets and similarity scores.

One thing to notice in the function of ‘calculate_score’ is that we’d like to consider the sequences of words instead of just comparing the words in any locations. That’s why ‘idx_2’ in the second for-loop starts with the location where the previous word was found instead of the first one.

Try these functions with two example texts to see how it works. They are basically the same notifications that only have different time frames, so we’d like to consider them identical.

As we expected, the synset list of 2 texts are identical and we get the score of 1.

This time let’s try with with 2 texts that have different words but similar meaning. One text has ‘Rescheduled’ and the other has ‘Delayed’, but we’d like to classify them into the same category.

One of the generated synsets is different from each other (obviously), but we get a score of ‘0.82’, which is a relatively high possibility, so we can safely classify them into the same category.

Once all texts of the training data are converted to synset list, we need to label each data with a proper category, which will be given to the input data to classify.

Each input data to classify will be compared with the training data using their synset path similarity and top k number of training data with the highest similarity scores, and will be returned with their label values. You can determine how to categorize the input data depending on the returned labels and scores.

Here I’ve presented how to implement the k-NN algorithm with synset path similarity method of WordNet to classify the customer emails. Once a prediction model is created using this algorithm, it is recommended to provide an API endpoint that classification requests can be sent through for easy access. Using AWS SageMaker will be one of the better ways to accomplish this because it provides a simplified solution, including automating the overall pipeline from training to deployment.

## Containers NOW!

The world of containers can be confusing.  What do you really need to get started using containers today? Starting small can lead to big results. Containers are the new... Learn More

## Cloud Neutrality and Strategic Positioning For Multi-cloud Support

Transition and transformation are happening at a rapid pace in the IT industry in terms of moving workloads and applications from on-premises to public clouds, in addition... Learn More

## How To Build A Serverless Alert Email Notification System Using Machine Learning

As an AWS partner, we receive a lot of email messages from AWS including many that need prompt attention. But with so many emails, it is very hard to detect which emails... Learn More