LIBSVM TUTORIAL PART 2 – Formatting the Data

Part 1
Part 2
Part 3
Part 4

In part one of this tutorial, I created 10 fake emails with 5 being Spam and 5 being not Spam.  The goal is to take these 10 emails, have the Support Vector Machine (SVM) learn from them, and be able to identify new emails as Spam or Not Spam.  The next step in this process is to get the data into a format that LibSVM can understand and learn from.

To format the data, we need to understand what LibSVM is actually going to look at and try to learn from.  In machine learning lingo, this is referred to as the “Feature Set”.  In the case of document classification (or our simple Spam Detection use case) we are going to use the words contained in each email as the feature set.  If a certain word like “Viagra” is found in a lot of Spam emails, but not found in legitimate emails, then the algorithm should learn that this indicates that an email is likely Spam.

Each feature (word) that the SVM learns from needs to have a value.  In our case it will be a simple binary operator.  If the word is contained in the email, it will be true (1) and if the word is not found in the email it will be false (0).

To represent each email, we will create a vector with the true/false values for every word in our universe (all the words in the 10 emails), but first, we need to identify every word that could possibly be in the email.  We will combine all the words from our data, and create a long list…

buy, viagra, cheap, drugs, with, no, prescription, by, mail, cialis, ed, others, like, and, hi, james, you, are, great, here, is, a, picture, of, my, dog, adding, to, the, email, list, send, me, your, we, going, give, you, raise

So above, you can see that we created a list of all the words in our emails.  The next step is to create a vector for each email, showing which words were in the email. So for example, we would take the first email:

“Buy Viagra cheap”

And we would format it as so:

buy=1, viagra=1, cheap=1, drugs=0, with=0, no=0, prescription=0, by=0, mail=0, cialis=0, ed=0, others=0, like=0, and=0, hi=0, james=0, you=0, are=0, great=0, here=0, is=0, a=0, picture=0, of=0, my=0, dog=0, adding=0, to=0, the=0, email=0, list=0, send=0, me=0, your=0, we=0, going=0, give=0, you=0, raise=0

Now, you might say that this was pretty tedious.  There are a lot of “=0”, or missing, features.  The good news is that we can use the idea of sparse vectors, or a sparse matrix, and only worry about the features (words) that are present.  So the above email could be simplified down to just:

buy=1, viagra=1, cheap=1

By not including the other words in our list, it is assumed that they were not in the email.

The next step to simplify this data, is to use indexes for the feature, instead of the whole word.  To do this, we would take our list of words, and use an integer to represent each word.  buy=1, viagra=2, cheap=3, drugs=4, with=5, etc.

1 = buy
2 = viagra
3 = cheap
4 = drugs
5 = with
6 = no
7 = prescription
8 = by
9 = mail
10 = cialis
11 = ed
12 = others
13 = like
14 = and
15 = hi
16 = james
17 = you
18 = are
19 = great
20 = here
21 = is
22 = a
23 = picture
24 = of
25 = my
26 = dog
27 = adding
28 = to
29 = the
30 = email
31 = list
32 = send
33 = me
34 = your
35 = we
36 = going
37 = give
38 = you
39 = raise

So the above email representation would be:

1=1 2=1 3=1

Where 1, 2, 3 are the words in the email, and “=1” means that the word was found.

Finally, to train, the SVM, we need to tell the algorithm which “class” each instance belongs.  The different classes in our case are “Spam” and “Not Spam”.  Since the format require a single word for each case, we’ll from here on refer to “Not Spam” as “Ham”.  Finally, the format requires us to use “:” (colon) instead of “=”.  This would result in the email properly formatted, looking like:

spam 1:1 2:1 3:1

And to build the entire training set data in the proper format, we would do this for each email on a new line in our input file.  For example, the second email in our list is spam, and has the following text:

“Cheap drugs, with no prescriptions”

This would translate to a new line containing:

spam 3:1 4:1 5:1 6:1 7:1

Finally, we would combing this all into one file that contained a new line for each email:

spam 1:1 2:1 3:1

spam 3:1 4:1 5:1 6:1 7:1

spam 1:1 4:1 8:1 9:1

spam 1:1 10:1 11:1 12:1

spam 1:1 7:1 4:1 13:1 2:1 10:1 14:1 12:1

ham 15:1 16:1 17:1 18:1 19:1

ham 16:1 20:1 21:1 22:1 23:1 24:1 25:1 26:1

ham 27:1 16:1 28:1 29:1 30:1 31:1

ham 32:1 33:1 22:1 23:1 24:1 34:1 26:1

ham 16:1 35:1 18:1 36:1 28:1 37:1 38:1 22:1 39:1

So, now we have completed data formatting.   In the next step, we can take this data and feed it into the learning algorithm of our SVM.  This will then produce a model that can be used to predict future emails and demonstrate the awesomeness of Support Vector Machines.


Tagged ,

18 thoughts on “LIBSVM TUTORIAL PART 2 – Formatting the Data

  1. Hardik says:

    Thanks for such a useful blog on LIBSVM.
    Can you please show one example for multi-class classification for “ONE-AGAINST-ONE”,”ONE-AGAINST-ALL”


  2. Siva Bhaskaran says:

    It is a really nice blog which explains the problem really well

    0=buy is not valid. The words have to be represented by a non-zero integer. I made this mistake while creating the .libsvm file and I got fried for days.

  3. piyush says:

    Hi this is such a great website bro
    I am working on svm to detect malacious executables
    so my feature set gonna b in lakhs
    can you help me understand how to make it???
    cuz manually doing what you did above looks impossible to me

  4. FrankChurf says:

    помощь в получении кредита – дубликат водительского удостоверения, продам нотариальные бланки.

  5. JamieLOW says:

    значительный веб сайт косметика оптом – арабские духи оптом, парфюм оптом.

  6. Global Message Here this by-product

  7. sildenafil says:

    Pandemic Point Fro this by-product

  8. Global Tidings Here this outcome

  9. Randysib says:

    Некоторые ресурсы позволяют заработать действительно неплохую сумму все зависит от сложности поставленной задачи. Оформив интернет-заявку на автозапчасти или сервисную услугу, вы получите сувенир в подарок от компании.

  10. I pay a quick visit each day some web sites and sites to read posts, however this webpage provides quality based writing.

  11. It’s going to be end of mine day, but before finish I
    am reading this fantastic paragraph to improve my know-how.

  12. angelina says:

    Good day and welcome to my website . I’m Angelina Mays.
    I have always dreamed of being a novelist but never dreamed I’d make a career of it. In college, though, I helped a fellow student who needed help. She could not stop telling me how well I had done. Word got around and someone asked me for to help them just a week later. This time they would compensate me for my work.
    During the summer, I started doing research papers for students at the local college. It helped me have fun that summer and even funded some of my college tuition. Today, I still offer my research paper writing to students.

    Professional Writer – Angelina Mays – How to Teach Students to Write an Essay Corp

  13. ct101.aspx says:

    The left-back, 22, is fuming that a move to Burnley was called off when the Tigers could not find a deadline-day replacement.

  14. jeanne says:

    Hello everyone, it’s Jeanne Denton here!
    I work as a professional an essay writer and have created this content with the intent of changing your life for the better. I started honing my writing skills in my school years. I learned that my fellow students needed writing help—and they were willing to pay for it. The money was enough to help pay my tuition for my first semester of college.
    Ever since school, I have continued to work as a professional writer. I was hired by a writing service based in the United Kingdom. Since then, the essays that I have created have been sold around Europe and the United States.
    In my line of work, I have become used to hearing, “Jeanne Denton, can you help me meet my writing assignment deadline?” I know that I can provide this service.

    Professional Writer – Jeanne Denton – Team

  15. mohammod says:

    Hello ,
    I’m Mohammod.
    If you’ve ever been too tired and couldn’t finish a academic paper, then you’ve come to the right place. I work with students in all areas of the writing technique. I can also write the essay from start to finish.
    My career as an academic writer started during my school years . After learning that I was very able in the field of academic writing, I decided to take it up as a profession.

    Skilled Academic Writer- Mohammod Clayton- Company

Leave a Reply

Your email address will not be published. Required fields are marked *