LIBSVM TUTORIAL PART 2 – Formatting the Data

Part 1
Part 2
Part 3
Part 4

In part one of this tutorial, I created 10 fake emails with 5 being Spam and 5 being not Spam.  The goal is to take these 10 emails, have the Support Vector Machine (SVM) learn from them, and be able to identify new emails as Spam or Not Spam.  The next step in this process is to get the data into a format that LibSVM can understand and learn from.

To format the data, we need to understand what LibSVM is actually going to look at and try to learn from.  In machine learning lingo, this is referred to as the “Feature Set”.  In the case of document classification (or our simple Spam Detection use case) we are going to use the words contained in each email as the feature set.  If a certain word like “Viagra” is found in a lot of Spam emails, but not found in legitimate emails, then the algorithm should learn that this indicates that an email is likely Spam.

Each feature (word) that the SVM learns from needs to have a value.  In our case it will be a simple binary operator.  If the word is contained in the email, it will be true (1) and if the word is not found in the email it will be false (0).

To represent each email, we will create a vector with the true/false values for every word in our universe (all the words in the 10 emails), but first, we need to identify every word that could possibly be in the email.  We will combine all the words from our data, and create a long list…

buy, viagra, cheap, drugs, with, no, prescription, by, mail, cialis, ed, others, like, and, hi, james, you, are, great, here, is, a, picture, of, my, dog, adding, to, the, email, list, send, me, your, we, going, give, you, raise

So above, you can see that we created a list of all the words in our emails.  The next step is to create a vector for each email, showing which words were in the email. So for example, we would take the first email:

“Buy Viagra cheap”

And we would format it as so:

buy=1, viagra=1, cheap=1, drugs=0, with=0, no=0, prescription=0, by=0, mail=0, cialis=0, ed=0, others=0, like=0, and=0, hi=0, james=0, you=0, are=0, great=0, here=0, is=0, a=0, picture=0, of=0, my=0, dog=0, adding=0, to=0, the=0, email=0, list=0, send=0, me=0, your=0, we=0, going=0, give=0, you=0, raise=0

Now, you might say that this was pretty tedious.  There are a lot of “=0”, or missing, features.  The good news is that we can use the idea of sparse vectors, or a sparse matrix, and only worry about the features (words) that are present.  So the above email could be simplified down to just:

buy=1, viagra=1, cheap=1

By not including the other words in our list, it is assumed that they were not in the email.

The next step to simplify this data, is to use indexes for the feature, instead of the whole word.  To do this, we would take our list of words, and use an integer to represent each word.  buy=1, viagra=2, cheap=3, drugs=4, with=5, etc.

1 = buy
2 = viagra
3 = cheap
4 = drugs
5 = with
6 = no
7 = prescription
8 = by
9 = mail
10 = cialis
11 = ed
12 = others
13 = like
14 = and
15 = hi
16 = james
17 = you
18 = are
19 = great
20 = here
21 = is
22 = a
23 = picture
24 = of
25 = my
26 = dog
27 = adding
28 = to
29 = the
30 = email
31 = list
32 = send
33 = me
34 = your
35 = we
36 = going
37 = give
38 = you
39 = raise

So the above email representation would be:

1=1 2=1 3=1

Where 1, 2, 3 are the words in the email, and “=1” means that the word was found.

Finally, to train, the SVM, we need to tell the algorithm which “class” each instance belongs.  The different classes in our case are “Spam” and “Not Spam”.  Since the format require a single word for each case, we’ll from here on refer to “Not Spam” as “Ham”.  Finally, the format requires us to use “:” (colon) instead of “=”.  This would result in the email properly formatted, looking like:

spam 1:1 2:1 3:1

And to build the entire training set data in the proper format, we would do this for each email on a new line in our input file.  For example, the second email in our list is spam, and has the following text:

“Cheap drugs, with no prescriptions”

This would translate to a new line containing:

spam 3:1 4:1 5:1 6:1 7:1

Finally, we would combing this all into one file that contained a new line for each email:

spam 1:1 2:1 3:1

spam 3:1 4:1 5:1 6:1 7:1

spam 1:1 4:1 8:1 9:1

spam 1:1 10:1 11:1 12:1

spam 1:1 7:1 4:1 13:1 2:1 10:1 14:1 12:1

ham 15:1 16:1 17:1 18:1 19:1

ham 16:1 20:1 21:1 22:1 23:1 24:1 25:1 26:1

ham 27:1 16:1 28:1 29:1 30:1 31:1

ham 32:1 33:1 22:1 23:1 24:1 34:1 26:1

ham 16:1 35:1 18:1 36:1 28:1 37:1 38:1 22:1 39:1

So, now we have completed data formatting.   In the next step, we can take this data and feed it into the learning algorithm of our SVM.  This will then produce a model that can be used to predict future emails and demonstrate the awesomeness of Support Vector Machines.


Tagged ,

LibSVM Tutorial Part 1 – Overview

Part 1
Part 2
Part 3
Part 4


Machine learning is a pretty complex topic that many articles online have been written about, but most of them are pretty hard to understand.  I would like to create an artifact on the web that might serve as a starting point to understanding the basics and figuring out how to use LibSVM and apply it to machine learning use cases.

Just some background about LibSVM… it is a “free” library that is available here.  Essentially, this library allows you to take some historical data, train your SVM to build a model, and then use this model to predict the outcome of new instances of your data.

The Data

For this tutorial, I’m going to be using the pretty standard use case of SPAM detection.  If we are able to look at past emails that have been marked as SPAM/Not SPAM, can we accurately predict whether a new email is SPAM or not?  While the data being used in this tutorial is obviously contrived, it will demonstrate how the same logic could be used for non-trivial cases.

Here we go…

Here are the sample emails we will use for our training set.  The first set of emails will be our SPAM set and the second will be valid, Not SPAM emails.



“Buy Viagra cheap”


“Cheap drugs, with no prescriptions”


“Buy drugs by mail”


“Viagra, Cialis, ED, others”


“Buy prescriptions drugs like viagra, cialis, and others.”


“Hi James you are great”


“James, here is a picture of my dog”


“Adding James to the email list”


“Send me a picture of your dog”


“James  we are going to give you a raise”


There you have it.  The initial data is 10 emails.  In the next step, we will pre-process these emails to a format that LibSVM understands, so that we can train our model.


Getting Weather from using Javascript and JQuery

I had a simple task to get a 7 day forecast of the high and low temperatures. To start off with, I looked at a number of sources to get this information from. Here is a good discussion about weather APIs, mostly geared toward iPhone.

I decided to go with the NOAA Rest service, and get the data from

Also, I ended up publishing it on Github if you want the full source.

The first step I did was use jQuery to hit the NOAA URI with the appropriate parameters:

$.get('' + zip +'&product=time-series&maxt=maxt&mint=mint&Submit=Submit'

This would return a huge XML structure which then needs to be parsed.

// Parse the XML response to get out the values we want
// Iterate over the day values and assigne to the right array
array[count] = $(this).text();
count = count + 1;

I guess the overall takeaway is that the NOAA provides a free interface to getting facts about the weather, though it seems like that interface may have been designed a while back, since it is a little cumbersome to use.

If you want more data than just the highs and lows for temperature, check out this website which shows you all the different options you can query for.

Tagged , ,

Free The Patents

I recently invested in a company called Vringo, whose sole reason to exist right now is to sue Google. The only reason I invested in it was for the money (hopefully), since it was the first patent troll that I’ve heard of that is a publicly traded company.

This has led to some weird thoughts rolling around in my head, relating to how I think the patent system is flawed. Should this company really be able to sue Google, just because they came up with an idea 10 years ago relating to showing relevant ads for users? Obviously, it wasn’t just the single idea that made Google all the money, but instead it was the fact that they had the best search engine. Their single idea that the patent is based on certainly didn’t keep Lycos in business.

There are some super smart people out there who can dream up ideas all day long. What if they patented all those ideas? Would they be a billionaire 10 years from now? I’m not sure.

The one thing I can think of to combat this weird use of patents, is to allow people to throw out ideas into a “public” space. Once the idea hits the public, no one else can patent it, right?

If we built an open database where people could just submit random ideas that they want someone to build “royalty free”, would there be any incentive to do so? Would people really spend 15 minutes to write out an intelligent description of an algorithm or process, that could be used to kill a patent lawsuit?

I’m going to think about this for a while. Maybe there could be some alternative incentives to get people to share…

Saving Omniauth Provider Data in the Database

I spent some time searching for a way to not need the App ID and Secret Key for OmniAuth in an initializer. Luckily I ran across a post mentioning this link:

By following these instructions, you can wait and set the App ID and Secret key at request time. In my case, I’m going to pull it out of a database table which stores config items.

My Current Code looks like this:

unless Rails.env.nil?
  CONFIG = YAML.load_file(Rails.root.join("config/secrets.yml"))[Rails.env]

  Rails.application.config.middleware.use OmniAuth::Builder do
    provider :facebook, CONFIG['fb_app_id'], CONFIG['fb_secret_key'], :scope => 'email,user_about_me,user_activities,user_birthday,user_groups', :display => 'popup'

And the future code will look like this:

SETUP_PROC = lambda do |env| 
  config = Config.find(:first)
  env['omniauth.strategy'].options[:consumer_key] = config.facebook_key
  env['omniauth.strategy'].options[:consumer_secret] = config.facebook_secret
use do
  provider :facebook, :setup => SETUP_PROC

Tagged , , ,


The next step in the sequence for loading Rails objects in javascript is to dynamically get them from the script. To do this, I’m going to use the JQuery “Get” method. I’ll be hitting the same URL that I set up in the last post.

Once I get the object, I’ll just add a simple alert method to the script, so I can be sure that I loaded the data correctly (or you could use a javascript debugger…).

    $.get("/restaurants/recommendations", function(restaurant) {
      alert("successfully loaded: " + JSON.stringify(restaurant));

Once I’ve confirmed that I have the correct object, I want to get the latitude and longitude values from the object and create a marker on the map that I have already set up.

  $.get("/restaurants/recommendations", function(restaurant) {  
    var myMarkerLatLng = new CM.LatLng(restaurant.lattitude, restaurant.longitude);
    var myMarker = new CM.Marker(myMarkerLatLng, {

When I put it all together the entire javascript function looks like this:

  var cloudmade = new CM.Tiles.CloudMade.Web({key: 'my_key'});
  var map = new CM.Map('map_canvas', cloudmade);

  map.setCenter(new CM.LatLng(35.998743, -78.90723), 13);
  map.addControl(new CM.LargeMapControl());

  $.get("/restaurants/recommendations", function(restaurant) {  
    var myMarkerLatLng = new CM.LatLng(restaurant.lattitude, restaurant.longitude);
    var myMarker = new CM.Marker(myMarkerLatLng, {

And I get to see my “Watts Grocery” marker on the map.

Tagged , , , , ,

How to get domain objects in javascript with rails (Part 1)

In continuing with my side project relating to maps and restaurants, I wanted to get things a little more interactive.  The next step is to combine all my restaurants I have the in database, and put them on the map.  To do this though, I needed to get the Restaurant Model objects into javascript so I could dynamically add them to as markers on the map.

The first step was to define a new controller action in my “app/controllers/restaurants_controller.rb” file.  For now, this will just find the first item in the DB and render it as a JSON object.  In the long run, this will return a list of recommended restaurants:

  def recommendations
    @restaurant = Restaurant.find(:first)
    render :json => @restaurant

Next, I needed to make that controller accessible via a URL.  My initial idea is to open up the URL like “/restaurants/recommendations” to return the list.  I modified the “config/routes.rb” file, and restarted the server:

  match "/restaurants/recommendations" => "restaurants#recommendations"

If I point the browser to this URL, I see the JSON I was expecting to see.  In this case, it is returning the restaurant “Watts Grocery”:

"name":"Watts Grocery",

Next up, I’ll show how to make a call in javascript to get this JSON string and parse it into an object so it can be displayed on the map.


Tagged , , ,

How to Parse XML with Ruby

With one of my side projects, I’ve been investigating how to integrate maps with a Rails application. One of the tasks for a proof of concept was to mock up some reviews for restaurants, and I came across the Overpass API, which exposes a nice interface for getting lists of map items.

To get an XML document with the list of restaurants in your area, you simply submit a http get request to the properly formed URL… like so:[bbox=7.1,51.2,7.2,51.3][amenity=restaurant]

The bbox attribute specifies the “bounding box” of the area you want to search as latitude and longitude coordinates.
The output from the query will give you xml that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="Overpass API">
<meta osm_base="2012-03-11T22:12:02Z"/>
  <node id="266302295" lat="36.0164193" lon="-78.9189592">
    <tag k="amenity" v="restaurant"/>
    <tag k="created_by" v="Potlatch 0.9a"/>
    <tag k="name" v="Watts Grocery"/>
  <node id="266814066" lat="36.0139023" lon="-78.9215805">
    <tag k="amenity" v="restaurant"/>
    <tag k="created_by" v="Potlatch 0.9a"/>
    <tag k="name" v="Magnolia Grill"/>
  <node id="266814217" lat="36.0106114" lon="-78.9222843">
    <tag k="amenity" v="restaurant"/>
    <tag k="created_by" v="Potlatch 0.9a"/>
    <tag k="name" v="Vin Rouge"/>

I took this XML and saved it to a file named restaurants.xml

After you have the XML, you sill need to parse it. To do that, I used REXML which is built into the ruby library. For this example, I create a “Restaurant” object for each of the XML nodes, and parsed it using XPath expressions to get the values I cared about. Then I saved off the values and made sure each object was saved to my DB.

xml ='./lib/restaurants.xml')

require 'rexml/document'
doc =

doc.elements.each('osm/node') {|x|
   r = = x.elements["tag[@k='name']"].attributes["v"]
   r.openmap_id = x.attributes["id"].to_i
   r.latitude = x.attributes["lat"]
   r.longitude = x.attributes["lon"]

Tagged , , ,

Playing with OpenStreetMap and Cloudmade

I’ve been doing some research into using maps in a web application, and I wanted to check out the different providers that are available.  Since hearing that Apple was moving away from the Google Maps API in iPhoto, I was curious to give it a try myself.

Since OpenStreetMap is a “free” alternative for mapping data, I looked around and was the first provider with an API that I was able see an easy way to use.  It works very similar to Google Maps, and has a similar javascript API.

Here’s how I added it to a simple Rails 3 app…

First, I created a new controller “app/controllers/maps_controller.rb” and put an index function in:

class MapsController < ApplicationController
  def index

Next, I created a separate javascript doc “app/assets/javascript/maps.js”:

$(document).ready(function(){ /*code here*/ 
    var cloudmade = new CM.Tiles.CloudMade.Web({key: 'my_key'});
    var map = new CM.Map('map_canvas', cloudmade);

     var geocoder = new CM.Geocoder('my_key');
    geocoder.getLocations('Durham, NC', function(response) {
      var southWest = new CM.LatLng(response.bounds[0][0], response.bounds[0][1]),
	      northEast = new CM.LatLng(response.bounds[1][0], response.bounds[1][1]);
      map.zoomToBounds(new CM.LatLngBounds(southWest, northEast));
    map.addControl(new CM.LargeMapControl());

Next, I created the maps index view “app/views/maps/index.html.rb” with a link to the open maps javascript library and the maps javascript doc:

<% content_for(:head) do %>
<%= javascript_include_tag "", :maps %>
<% end %>

  user = <%= current_user %>

<br />
<br />
<div id="map_sidebar">Top Restaurants </div>
<div id="map_canvas"></div>

Finally, modify the routes.rb file to allow localhost:8080/maps to be directed to my new controller and view:

  match "/maps" => "maps#index"

Add this all up and we get a nice map showing on our page:

The only weird thing I noticed is that there are some issues when zooming in.  After a certain point, the tiles stop loading:

If you’re interested in learning more, the CloudMade tutorials show pretty much how to do everything in easy steps here.

Tagged , , , , , ,

Hello world!

Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!