Overview
Machine learning is a pretty complex topic that many articles online have been written about, but most of them are pretty hard to understand. I would like to create an artifact on the web that might serve as a starting point to understanding the basics and figuring out how to use LibSVM and apply it to machine learning use cases.
Just some background about LibSVM… it is a “free” library that is available here. Essentially, this library allows you to take some historical data, train your SVM to build a model, and then use this model to predict the outcome of new instances of your data.
The Data
For this tutorial, I’m going to be using the pretty standard use case of SPAM detection. If we are able to look at past emails that have been marked as SPAM/Not SPAM, can we accurately predict whether a new email is SPAM or not? While the data being used in this tutorial is obviously contrived, it will demonstrate how the same logic could be used for non-trivial cases.
Here we go…
Here are the sample emails we will use for our training set. The first set of emails will be our SPAM set and the second will be valid, Not SPAM emails.
SPAM
Email1
“Buy Viagra cheap”
Email2
“Cheap drugs, with no prescriptions”
Email3
“Buy drugs by mail”
Email4
“Viagra, Cialis, ED, others”
Email5
“Buy prescriptions drugs like viagra, cialis, and others.”
Not SPAM
Email6
“Hi James you are great”
Email7
“James, here is a picture of my dog”
Email8
“Adding James to the email list”
Email9
“Send me a picture of your dog”
Email10
“James we are going to give you a raise”
There you have it. The initial data is 10 emails. In the next step, we will pre-process these emails to a format that LibSVM understands, so that we can train our model.