The whole purpose of using a Support Vector Machine is to be able to predict whether new instances of an object belong to a certain group. This could be detecting whether documents are sensitive, stocks are a “buy”, or whether it will rain. In our case, we are trying to determine whether emails are SPAM or not. So to test this, we need to come up with more instances of emails.
For the first test, we will use a sample email:
James, can you pick up the dog?
To translate this to the proper input format to test, we once again need to take each word and map them to our previous vector of words:
james=16
can=???
you=17
pick=???
up=???
the=29
dog=26
Any words with ??? mean that we didn’t see those words during the training phase, so we will essentially ignore them for now. So to convert that new email to the proper input format, it would look like:
0 16:1 17:1 26:1 29:1
Since we know that this email should not be classified as SPAM, we start it off with a “0”, then include a value for any words we already know about. We will then save this text in a file named “sample.1.txt” and run the following command line to test it:
c:\Program Files\LibSVM\windows>svm-predict.exe sample.1.txt spam.train.model sample.1.predicted.txt
Accuracy = 100% (1/1) (classification)
The output on the command line tells us that the algorithm predicted the email was HAM and not SPAM since it had an accuracy of 100%. Also, if you open up the sample.1.predicted.txt, you will see a single entry with “0” indicating that it predicted the first line in the input file belonged to the class “0” or HAM.
Now lets, add some new sample emails to test:
Cheap viagra by mail!
James, you need viagra.
The first email is obviously a SPAM type email, but the second one is a little more interesting. If that was sent from a stranger, then it might be SPAM, however if my wife sent it, it may not 🙂
So let’s add them to our input file along with the first one, it would look like:
0 16:1 17:1 26:1 29:1
1 2:1 3:1 8:1 9:1
0 2:1 16:1 17:1
And then, lets run the algorithm to see what it thinks:
c:\Program Files\LibSVM\windows>svm-predict.exe sample.1.txt spam.train.model sample.1.predicted.txt
Accuracy = 100% (3/3) (classification)
As you can see, the algorithm did very well. It thought that the first email was HAM, the second was SPAM, and the third was HAM. The output file shows 0, 1, 0.
While this was a trivial and made up example, I hope that I met the overall goal, which was to show how to use LIBSVM for classification problems. By building an input training set, generating a predictive model, and testing it against inputs we showed that Support Vector Machines can be powerful and easy to use tools in Machine Learning.