The whole purpose of using a Support Vector Machine is to be able to predict whether new instances of an object belong to a certain group. This could be detecting whether documents are sensitive, stocks are a “buy”, or whether it will rain. In our case, we are trying to determine whether emails are SPAM or not. So to test this, we need to come up with more instances of emails.
For the first test, we will use a sample email:
James, can you pick up the dog?
To translate this to the proper input format to test, we once again need to take each word and map them to our previous vector of words:
james=16
can=???
you=17
pick=???
up=???
the=29
dog=26
Any words with ??? mean that we didn’t see those words during the training phase, so we will essentially ignore them for now. So to convert that new email to the proper input format, it would look like:
0 16:1 17:1 26:1 29:1
Since we know that this email should not be classified as SPAM, we start it off with a “0”, then include a value for any words we already know about. We will then save this text in a file named “sample.1.txt” and run the following command line to test it:
c:\Program Files\LibSVM\windows>svm-predict.exe sample.1.txt spam.train.model sample.1.predicted.txt
Accuracy = 100% (1/1) (classification)
The output on the command line tells us that the algorithm predicted the email was HAM and not SPAM since it had an accuracy of 100%. Also, if you open up the sample.1.predicted.txt, you will see a single entry with “0” indicating that it predicted the first line in the input file belonged to the class “0” or HAM.
Now lets, add some new sample emails to test:
Cheap viagra by mail!
James, you need viagra.
The first email is obviously a SPAM type email, but the second one is a little more interesting. If that was sent from a stranger, then it might be SPAM, however if my wife sent it, it may not 🙂
So let’s add them to our input file along with the first one, it would look like:
0 16:1 17:1 26:1 29:1
1 2:1 3:1 8:1 9:1
0 2:1 16:1 17:1
And then, lets run the algorithm to see what it thinks:
c:\Program Files\LibSVM\windows>svm-predict.exe sample.1.txt spam.train.model sample.1.predicted.txt
Accuracy = 100% (3/3) (classification)
As you can see, the algorithm did very well. It thought that the first email was HAM, the second was SPAM, and the third was HAM. The output file shows 0, 1, 0.
While this was a trivial and made up example, I hope that I met the overall goal, which was to show how to use LIBSVM for classification problems. By building an input training set, generating a predictive model, and testing it against inputs we showed that Support Vector Machines can be powerful and easy to use tools in Machine Learning.
Found your blog while searching for a good LibSVM tutorial. Very Easy to understand and fun to read. I almost laughed when I read the third email for testing. Hope you didn’t need it. Thank you for the nice tutorial!
Glad it was helpful… and hopefully I won’t get that email from my wife any time soon 🙂
Thank u Mr.James, as just started my research work in opinion mining,very helpful and got clear understanding of creating training and testing data using libsvm. But for long text and larger no of documents,is there any way to create train and test data automatically
If you are interested in running some machine learning algorithms against larger data sets and are just getting started, I would recommend Weka. This tool has a UI to interact with and integrates with LibSVM and others.
Also, Weka can read in test files and get them into the correct format for testing.
http://www.cs.waikato.ac.nz/ml/weka/
Let me know if you would like to have me do a tutorial on it, and I can see what I can do.
Hello James
Thanks for the tutorial, it has been helpful. I will be using libsvm but one set of my data is a measurement of two dependent parameters against time. So its more like a time series data, is this time of data learnable by SVM? I actually want to detect outliers.
Thanks in advance
Hi Peace,
Time series data can be a little trickier, and I haven’t actually done any work that would relate exactly. Support Vector Machines are pretty good at not letting dependent parameters skew the results, so you may have luck.
Also, if what you’re really looking for is an outlier detection, I would suggest to investigate a clustering algorithm. Check out k-Means. The idea is to arrange data in clusters, and then find the data points that are furthest from the centers of the clusters.
Hello James
Very Easy to understand and fun to read.
Thanks for the tutorial.
Can you write a tutorial for LIBLINEAR?
Thanks! very helpful!:)
i want to classify words in a sentence to different parts of speech categories. I want to write features like words occuring before and and after a particular word, prefixes,suffixes etc….how should i write the train file.Morever this is a multi-classification problem.please help me
You, Sir, are just AWESOME!!! Thanks a bunch for this tutorial! Found it just in time 🙂
Thank You very much!
A tip for MatLab Users: It has another interface with different syntax.
See: https://ece.uwaterloo.ca/~nnikvand/Coderep/libsvm-3.16/matlab/libsvm-3.pdf
Usage =====
matlab> model = svmtrain(training_label_vector, training_instance_matrix [, ‘libsvm_options’]);
-training_label_vector:
An m by 1 vector of training labels (type must be double).
-training_instance_matrix:
An m by n matrix of m training instances with n features. It can be dense or sparse (type must be double).
-libsvm_options:
A string of training options in the same format as that of
LIBSVM.
matlab> [predicted_label, accuracy, decision_values/prob_estimates] = svmpredict(testing_label_vector, testing_instance_matrix, model [, ‘libsvm_options’]);
-testing_label_vector:
An m by 1 vector of prediction labels. If labels of test
data are unknown, simply use any random values. (type must be
double)
-testing_instance_matrix:
An m by n matrix of m testing instances with n features. It can be dense or sparse. (type must be double)
-model:
The output of svmtrain.
-libsvm_options:
A string of testing options in the same format as that of
LIBSVM.
Hello webmaster do you need unlimited content for your website ?
What if you could copy article from other blogs, make it pass copyscape test and publish on your site – i know
the right tool for you, just search in google:
Ziakdra’s article tool
hi…
Can u explain , how kernel boundaries are generated ??
also. Please explain what is there in sample.train.model file ???
Thanks in advance …. 😀
Very nice tutorial James.
I am using libsvm java API for document classification of resumes. I am able to run prediction on test data and get the accuracy. How do I get the label for that particular prediction?
How do I use libsvm for multiple classes. Please help.
Thank you very much James… 🙂
Hi James,
I am trying to implement image classification. My feature set will be histograms. Is there anyway I can do this using libsvm?
Because libsvm assumes inputs as vectors, but I will have a histogram as my featureset. How do I work it, any idea?
Thanks!
Dear all,
I have a question regarding LibSVM. Can we use libsvm testing without labels. I mean first train the system with labels (e.g 1 or -1) and in testing, do not label and use the data to see which data row show which class and then compared with the original one.
I want to do testing without labeling and then compare with original output to find difference.
Waiting for your kind reply.
Thanks in advance
I think you may be describing “unsupervised learning”. Check that out and see if it makes sense.
hello james,kindly tell that how libsvm can be used for regression
Thanks for this funny but illustrative tutorial! Had fun reading it 🙂
Funny and easiest way to learn libsvm 🙂
thanks for the tutorial. if some one need more deep knowledge in machine learning i recommended:
https://www.coursera.org/learn/machine-learning
amine.b
I have text data descriptions.Using those descriptions I want to produce training set for further descriptions to be predicted whether they are valid or not.I am trying to achieve it using weka libsvm,but not getting desired output as desired. I want output as : 1 1:10 2:3 3:11 for positive one and -1 1:7 2:5 5:6 where 1 indicate positive -1 indicate negative and in 1:7 indicates 1st word occurred 7 times.How can I achieve this using libsvm ? and in what format I should build my text file or csv file ?
Hi James! I have a question.
You put cero 0 or one 1 in prediction file.
But I want clasify, and I don’t know the label for the prediction file.
I undertood that later the training model I use the prediction and I have a file with cero or one and this is the response.
pd:Sorry my bad English
Thanks James
Hello admin, i must say you have very interesting content here.
Your blog should go viral. You need initial traffic
only. How to get it? Search for; Mertiso’s tips go viral
I need to test a text data email. the email have 10 words . if only 1 word is spam and 9 words extant are not of train data… >> the email is spam or no spam
Hello James,
this is a great tutorials, can you share the code with me, please ,my email : odysius.anwar@gmail.com
It will help me so much,
Thank you
Hi James! Thanks for such a great explanation.Now I am very happy that I got to know how to work with lib SVM. But, I have one question in my mind which is not letting me to sleep.Could you please suggest me on how to achieve it.The scenario is below.
You kept zero 0 or one 1 in prediction file.
But I want classify, and I don’t know the label for the prediction file.How to come across this.Your help is greatly appreciated.