Tag Archives: libsvm

LIBSVM TUTORIAL PART 2 – Formatting the Data

Part 1
Part 2
Part 3
Part 4

In part one of this tutorial, I created 10 fake emails with 5 being Spam and 5 being not Spam.  The goal is to take these 10 emails, have the Support Vector Machine (SVM) learn from them, and be able to identify new emails as Spam or Not Spam.  The next step in this process is to get the data into a format that LibSVM can understand and learn from.

To format the data, we need to understand what LibSVM is actually going to look at and try to learn from.  In machine learning lingo, this is referred to as the “Feature Set”.  In the case of document classification (or our simple Spam Detection use case) we are going to use the words contained in each email as the feature set.  If a certain word like “Viagra” is found in a lot of Spam emails, but not found in legitimate emails, then the algorithm should learn that this indicates that an email is likely Spam.

Each feature (word) that the SVM learns from needs to have a value.  In our case it will be a simple binary operator.  If the word is contained in the email, it will be true (1) and if the word is not found in the email it will be false (0).

To represent each email, we will create a vector with the true/false values for every word in our universe (all the words in the 10 emails), but first, we need to identify every word that could possibly be in the email.  We will combine all the words from our data, and create a long list…

buy, viagra, cheap, drugs, with, no, prescription, by, mail, cialis, ed, others, like, and, hi, james, you, are, great, here, is, a, picture, of, my, dog, adding, to, the, email, list, send, me, your, we, going, give, you, raise

So above, you can see that we created a list of all the words in our emails.  The next step is to create a vector for each email, showing which words were in the email. So for example, we would take the first email:

“Buy Viagra cheap”

And we would format it as so:

buy=1, viagra=1, cheap=1, drugs=0, with=0, no=0, prescription=0, by=0, mail=0, cialis=0, ed=0, others=0, like=0, and=0, hi=0, james=0, you=0, are=0, great=0, here=0, is=0, a=0, picture=0, of=0, my=0, dog=0, adding=0, to=0, the=0, email=0, list=0, send=0, me=0, your=0, we=0, going=0, give=0, you=0, raise=0

Now, you might say that this was pretty tedious.  There are a lot of “=0”, or missing, features.  The good news is that we can use the idea of sparse vectors, or a sparse matrix, and only worry about the features (words) that are present.  So the above email could be simplified down to just:

buy=1, viagra=1, cheap=1

By not including the other words in our list, it is assumed that they were not in the email.

The next step to simplify this data, is to use indexes for the feature, instead of the whole word.  To do this, we would take our list of words, and use an integer to represent each word.  buy=1, viagra=2, cheap=3, drugs=4, with=5, etc.

1 = buy
2 = viagra
3 = cheap
4 = drugs
5 = with
6 = no
7 = prescription
8 = by
9 = mail
10 = cialis
11 = ed
12 = others
13 = like
14 = and
15 = hi
16 = james
17 = you
18 = are
19 = great
20 = here
21 = is
22 = a
23 = picture
24 = of
25 = my
26 = dog
27 = adding
28 = to
29 = the
30 = email
31 = list
32 = send
33 = me
34 = your
35 = we
36 = going
37 = give
38 = you
39 = raise

So the above email representation would be:

1=1 2=1 3=1

Where 1, 2, 3 are the words in the email, and “=1” means that the word was found.

Finally, to train, the SVM, we need to tell the algorithm which “class” each instance belongs.  The different classes in our case are “Spam” and “Not Spam”.  Since the format require a single word for each case, we’ll from here on refer to “Not Spam” as “Ham”.  Finally, the format requires us to use “:” (colon) instead of “=”.  This would result in the email properly formatted, looking like:

spam 1:1 2:1 3:1

And to build the entire training set data in the proper format, we would do this for each email on a new line in our input file.  For example, the second email in our list is spam, and has the following text:

“Cheap drugs, with no prescriptions”

This would translate to a new line containing:

spam 3:1 4:1 5:1 6:1 7:1

Finally, we would combing this all into one file that contained a new line for each email:

spam 1:1 2:1 3:1

spam 3:1 4:1 5:1 6:1 7:1

spam 1:1 4:1 8:1 9:1

spam 1:1 10:1 11:1 12:1

spam 1:1 7:1 4:1 13:1 2:1 10:1 14:1 12:1

ham 15:1 16:1 17:1 18:1 19:1

ham 16:1 20:1 21:1 22:1 23:1 24:1 25:1 26:1

ham 27:1 16:1 28:1 29:1 30:1 31:1

ham 32:1 33:1 22:1 23:1 24:1 34:1 26:1

ham 16:1 35:1 18:1 36:1 28:1 37:1 38:1 22:1 39:1

So, now we have completed data formatting.   In the next step, we can take this data and feed it into the learning algorithm of our SVM.  This will then produce a model that can be used to predict future emails and demonstrate the awesomeness of Support Vector Machines.


Tagged ,