Monthly Archives: January 2013

LIBSVM TUTORIAL PART 4 – Testing the Model

Part 1
Part 2
Part 3
Part 4

The whole purpose of using a Support Vector Machine is to be able to predict whether new instances of an object belong to a certain group.  This could be detecting whether documents are sensitive, stocks are a “buy”, or whether it will rain.  In our case, we are trying to determine whether emails are SPAM or not.  So to test this, we need to come up with more instances of emails.

For the first test, we will use a sample email:

James, can you pick up the dog?

To translate this to the proper input format to test, we once again need to take each word and map them to our previous vector of words:

james=16

can=???

you=17

pick=???

up=???

the=29

dog=26

Any words with ??? mean that we didn’t see those words during the training phase, so we will essentially ignore them for now.  So to convert that new email to the proper input format, it would look like:

0 16:1 17:1 26:1 29:1

Since we know that this email should not be classified as SPAM, we start it off with a “0”, then include a value for any words we already know about.  We will then save this text in a file named “sample.1.txt” and run the following command line to test it:

c:\Program Files\LibSVM\windows>svm-predict.exe sample.1.txt spam.train.model sample.1.predicted.txt
Accuracy = 100% (1/1) (classification)

The output on the command line tells us that the algorithm predicted the email was HAM and not SPAM since it had an accuracy of 100%.  Also, if you open up the sample.1.predicted.txt, you will see a single entry with “0” indicating that it predicted the first line in the input file belonged to the class “0” or HAM.

Now lets, add some new sample emails to test:

Cheap viagra by mail!

James, you need viagra.

The first email is obviously a SPAM type email, but the second one is a little more interesting.  If that was sent from a stranger, then it might be SPAM, however if my wife sent it, it may not 🙂

So let’s add them to our input file along with the first one, it would look like:

0 16:1 17:1 26:1 29:1
1 2:1 3:1 8:1 9:1
0 2:1 16:1 17:1

And then, lets run the algorithm to see what it thinks:

c:\Program Files\LibSVM\windows>svm-predict.exe sample.1.txt spam.train.model sample.1.predicted.txt
Accuracy = 100% (3/3) (classification)

As you can see, the algorithm did very well.  It thought that the first email was HAM, the second was SPAM, and the third was HAM.  The output file shows 0, 1, 0.

While this was a trivial and made up example, I hope that I met the overall goal, which was to show how to use LIBSVM for classification problems.  By building an input training set, generating a predictive model, and testing it against inputs we showed that Support Vector Machines can be powerful and easy to use tools in Machine Learning.

LIBSVM TUTORIAL PART 3 – Training the Model

Part 1
Part 2
Part 3
Part 4

Now that we have data to feed into the SVM as a training set, there is a couple more tweaks that need to be made.  First, we need to set the training set to have numeric values for the different classes of data ( 1 = SPAM and 0 = HAM) and also we need to make sure the inputs are in ascending order.  So the final input training file should look like:

1 1:1 2:1 3:1
1 3:1 4:1 5:1 6:1 7:1
1 1:1 4:1 8:1 9:1
1 1:1 10:1 11:1 12:1
1 1:1 2:1 4:1 7:1 10:1 12:1 13:1 14:1
0 15:1 16:1 17:1 18:1 19:1
0 16:1 20:1 21:1 22:1 23:1 24:1 25:1 26:1
0 16:1 27:1 28:1 29:1 30:1 31:1
0 22:1 23:1 24:1 26:1 32:1 33:1 34:1
0 16:1 18:1 22:1 28:1 35:1 36:1 37:1 38:1 39:1

As you can see each line represents one of our sample emails.  If the line starts with 1 then it represents a SPAM email and if the line starts with 0 then it represents a HAM email (not spam).  Then, each number:number sequence represents a word found in the email.  For example, any line with “1:1” means that the word “buy” was found in that email.

Now that we have the input file, save it as a file named “Spam.train”.

To create the predictive model, run the following command line (on Windows):

C:\Program Files\LibSVM\windows>svm-train.exe spam.train

*
optimization finished, #iter = 5
nu = 1.000000
obj = -7.583904, rho = 0.229345
nSV = 10, nBSV = 10
Total nSV = 10

After running this command, there will now be a “Spam.train.model” file that will be used as input when classifying any new emails.  We will see that in the next part.