Gesture Recognition System

Introduction

gesture
The article below presents the results of one of the parts of my master thesis. The purpose of the thesis was to research methods that are applicable for move tracking and gesture recognition and a try of building a working system. Firstly the idea of the study was to apply gathered knowledge for sign-language translation. Unfortunately the lack of time (4 months) and hardware limitation allowed only for proof of concept’s implementation. Although the system is good enough for human-computer interaction.

Assumptions

Unfortunately the real world is to complicated to be described well within a computer application. A common technique while implementing complex systems is to make assumptions that simplify the real problem to a simpler one. Also in this case I made some simplification that limited users behavior but allowed me to build a working system within reasonable time. Therefore the main assumption I did:
– Only one object moves at time.
– We consider only moves in 2D plane.
– Constant source of light.
– Camera doesn’t move.

The first assumption explains by itself. Just to be precise, having just one moving objects removes the necessity of deciding what is the object that actually needs to be traced, or tracing numbers of objects. The second one tells that depth is not considered. A very important assumptions was the last one. As the segmentation was colored based and I was working only with a very color sensitive web-camera I couldn’t allow on working with sun-light. Even little changes of the light angle or saturation changes the color perceived by the camera too much.

The assumptions are just some abstractions that are used mainly for simplifying the image-processing part and they could be chosen freely as long as they don’t violate the system specification. The main point of them is to skip some of the problems that are not the crucial part of the system, and concentrate on what really matters.

Segmentation

The method I decided to use for segmentation uses the colors histogram approach. Before tracing starts, there is one extra preprocessing step (you may check it on the video), where I select the area to take a sample of colors to trace. The color value appearing more often in the sample has larger value in the histogram. Then I propagate back the histogram to the image, but this time in the grayscale. The bigger value of the color, the more white it becomes on the output image (the other window on the left). It is worth mentioning that in this case of original image I used HSV model rather than RGB.

Picture 1. Click to see the movie presenting segmentation.

Picture 1. Click to see the movie presenting segmentation.

The second step of the segmentation is to abandon elements that are not interesting for the application. Even if we made an assumption that there is only one object moving at time, it’s almost impossible to keep your head not moving at all. It introduces some problems as it has similar color histogram as hands and may confusing the tracing algorithm. Moreover using complex head/face detection approaches using boosting approaches didn’t make sense as they are computationally expensive and I simply didn’t need them. Nevertheless the head was not the only problem. There could be some other object with matching color histogram in the area. Luckily the solution described below solved the issue.

What was done in this case, based on the observation that heads moves way slower than hands (in the best case doesn’t move at all). Thus the consecutive frames are compared and pixels that don’t change much are abandoned. The comparison is done by substracting pixels coming from different frames and comparing them to the threshold defined on the base of experiments.

Tracing

The tracing algorithm has to met some requirements that are coming directly from the nature of the moves: they don’t have to be continues and linear. This makes the tracing problem a little bit more complicated as in case of instant change of the direction, the tracing algorithm has to notice it quickly and not to loose the followed object. Another thing is that the tracing object might be occluded by some other objects. In our case this problem occurs when we move hand above head. We have eliminated this problem partially in the previous step, but it was not sufficient – we managed not to trace the head, but there were difficulties of losing the hand when it appeared over the head.

Picture 2. Click to see the movie presenting tracing.

Picture 2. Click to see the movie presenting tracing.

After the SLR (Systematic Literature Review) as the first part of the thesis, I decided to apply two strategies described in the literature: Meanshift algorithm for tracing and Kalman Filter for predicting the position of the hand in the next frame.

Meanshift algorithm has been applied since 1998 in tracking problems successfully. The formal basis of the method was described by Fukunaga and Comaniciu at [1]. However the main idea might be described as follows:

1. Choose the window
a) set the initial position of the window
b) choose the density function type
c) choose the shape of the window
d) choose the size of the window.
2. Count the center of mass.
3. Move the window to the center of mass.
4. Move back to point 2.

As we can expect I associated the back-propagation method with the Meanshift algorithm – the window is moved to the point that is the center of the window, that suits the color histogram the most.

Kalman Filter was firstly described in 1960. The precise explanation of the method might be found in Welsh and Bishop [2] paper. The point of the approach is to keep the information about previous measures and maximize the a-posteriori probability of the position’s prediction. But instead of keeping a long list of previous measures we are creating a model that is iteratively updated.

Gesture representation

Picture 3. The shape recognition step. Click to watch the movie.

Picture 3. The shape recognition step. Click to watch the movie.

The representation of a gesture is a sequence of directions and shapes. A direction is represented using the idea taken from the compass rose with the eight principal winds, except the winds are vectors outgoing from the center. Every direction has assigned a number from 1 to 8 and describes the direction of the last move.

The shape is one of the predefined shapes of the hand. The system gathers the contours of the hand and uses a SVM (Support Vector Machine) classifier to obtain related label. The contours were represented using Freeman Chain Codes.

Finally a gesture could look like a sequence of pairs [1,’A’], [1, ‘B’] , [2, ‘B’], [, ‘STOP’]. Notice that when the ‘STOP’ shape is met, the direction doesn’t matter.

Recognition

Hidden Markov Model has been applied for gesture recognition. It was trained using some set of predefined actions that were then translated to a language word. Every sequence of movements was ending with a special move meaning “end of gesture”. It simplified the processing a lot, but it is not a very good practice and “don’t try it at home” 😉 The results might be seen on the last video.

Training

The training was split into two parts. Firstly I was learning SVM just for the shape recognition. When it was done, I approached the second step of learning the HMMs using outputs from SVMto describe the shape.

Conclusion

Even if the system I implemented was very limited, it proved the concepts of the features choice and the tracing algorithms as strong enough for a human-computer interaction. Nevertheless the segmentation method has to be improved as the current one is a home-made solution and the assumptions it takes are too strong. Secondly the language processing shouldn’t require the “stop word” and should be done fully dynamically.

References

[1]. K. Fukunaga, “Introduction to Statistical Pattern Recognition”, Boston, Academic Press, 1990
[2]. G. Welsh, G. Bishop “An introduction to Kalman filter” (Technical Report TR95-041), University of North Carolina, Chapel Hill, NC, 1995
[3]. D. Comaniciu and P. Meer, “Mean shift analysis and applications”, IEEE Internation Conference on Computer Vision (vol 2, p. 1197), 1999.

Advertisements

Spam Identifcation in Social Networks

Introduction

Picture 1. The term spam comes from the sketch of Monty Python.

Picture 1. The term spam comes from the sketch of Monty Python.

Spam is a term that has been used for ages and has been experienced by numbers of users browsing the Internet, accessing email boxes, or checking their social networks accounts. The first two cases have been covered in the literature very well and there is no much left to say. However, the last one is a very interesting issue and the countermeasures used in social networks extend classical factors known from text spam or spurious pages detection.

Spam in social networks

Social networks seem to be a more fancy way of communication than emails. Extra features like walls, timelines, photo albums and nicer interfaces attract especially young people. However email is still used but it seems like it became a more formal form used in serious conversations.

The idea of Facebook is a network of trust, where the user is the one who decides what he wants to share and with who. It might be obtained by privacy settings, sending friend requests only to known people and confirming requests only from people that are known. Unfortunately the social phenomenon of gaining popularity and self-promotion via social platforms has skewed the assumptions of the network of trust. The results presented in [1] shows that 41% of Facebook users accept any invitation they receive, even if they met the sender before. Moreover, 45% of users click on the links posted by their connections even if they don’t know the person in the real life.

Most of spammers work in an automated way using bots executing commissioned tasks. One of the most widely use tactics against bots is letting the user to enter an input that might be easily provided by a human but brings lot’s of problems to machines. The classical example of such a technique is CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) that has been met by any of the Internet users at least once.

However asking a user to solve CAPTCHA every time he makes an action, might be really frustrating and leads to decrease of the number of users using the service. Therefore, there has to be a clever way of deciding when the extra test should be run. The purpose of this article is to give some examples of countermeasures that might be used to determine if the user is a human or a spamming bot. When the features are applied wit the support of machine learning we should be able to build a quite strong spam filter. Originally this article was way longer, but I have decided to split it into smaller parts.

Spammers groups

Usually the first action that is taken by a spammer is to send a friend request, so that the spammer have access to the wall and contacts of the target as soon as the request is accepted. The same result might be obtained when a user allows a Facebook application to access his profile. Moreover, there are webpages that require access to user information and this strategy appears to be the most effective and successful according to my personal observation.

When the first condition is met and the spammer obtain the access he needs, he may start his campaign. There have been distinguished several categories of spammers behaviour:

displayer: does not send messages, but only display them on their own pages (about section i.e.)
bragger: posts messages to their own feed so they are visible for friends.
poster: publishes messages to the victims walls.
whisperer: sends private messages.

The most popular on Facebook seems to be poster strategy as it reaches the widest target. There is also another way of grouping spammers related to their spamming activity:

greedy: spam is contained in every content that is published.
stealhy: spam is published only once a time, meanwhile most of the messages are legitimate.

It seems to be obvious that the second type of spammers is way harder to detect. Luckily the spammers usually apply simple strategies and stealthy bots are rarely met. I guess it might be related to the costs of maintaining a stealthy mechanism as some extra work has to be done to generate legitimate content. Nevertheless the stealthy approach is more dangerous than the greedy one, as the user might be not aware of the fake identity of the sender. Secondly the sender builds trust as most of the messages are legitimate, what may lead to tries of pishing.

Countermeasures for spam detection

The countermeasures used for spam detection strictly depends on the nature of the the service. However there are some common factors that might be applied for all the platforms. One of them is Followers/Following Ratio. It is a comparison between the exact numbers of friends (followers) a user has, to the number of sent requests. A legitimate user is not accepted only exceptionally, as usually he sends requests to persons knowing him. The opposite happens for a spam account where most of its requests are rejected. As I mentioned in the previous paragraph, this idea might be extended for the problem of applications and webpages asking for access.

A very common technique is to consider how many accounts are accessed from one host within a time frame. As it has been already mentioned, spammers usually do not run all the actions manually as it would be to repetitive and wearisome for a human being, instead they run bots that do all the actions for them. A bot may try to access one account and after finishing its job, it switches to another one. If the platform identifies a try of access to numerous accounts within short period of time, the behavior seems to be suspicious. However if the bot is slow enough and doesn’t switch between accounts to fast, this measure might not be strong enough.

Another time range based observation is the activity of the users. Obviously, every user access the service at different time range and takes different type of actions. But it is quite uncommon to observe legitimate users login every night at 2 am and send hundreds of similar messages, or repeat the same or similar action in equal periods of time. This kind of action might be found as requiring further investigation. If there is only a finite number of actions that might be taken by a user, we may build a probabilistic model describing a typical user behaviour during a session. This technique is called anomaly detection and is extremely useful when it is hard to define what kind of behaviour we are going to consider as suspicious, but we have plenty of examples of users using the service in a normal way. I will try to take a look at this method more closely soon.

We have already mentioned the message similarity. In case the spammers posts messages, or comments that might be very similar to each other. If the platform identifies a user sending a certain set of information that differ only slightly, or don not differ at all, it might be considered as a spam approach. When there are numbers of similar messages send from different profiles, it is has a professional name Spam Campaign. Also in this place we may use probabilist model to describe language of legitimate messages and check new messages against this model. And also this I will try to describe in one of the next posts.

The ratio of sent messages that contain URLs to the amount of total sent messages might be also considered as a measure. Secondly spammers use “link shortener” services like tinyurl.com to hide the real identity of the webpage.

The spammers need to obtain the name of user they are going to target in some way. One very common technique is to target users coming from a prepared list i.e. add people with the name “Clint” or “Joe” and the last name “Kidd” or “Eastwood”. The previous approach could be extended by analyzing the social graph. Mostly people have clusters of friends, related to school, work, common interests and it is very likely that the person they are going to add will be associated with one of the clusters in some way. I.e. it could be common friends, common set of interests, or living place. When a user is trying to follow numbers of people with a very low association score to any of the clusters he has, it should be considered as potential anomaly. However the latter measure seems to be applicable for Facebook, but not much for services like Soundcloud as they do not provide reach enough social graph.

A very interesting measure that could be also applied is the interaction history. Although one account may establish a large number of social links in the social graph, it only interacts with a small subset of its friends. A sudden burst that the account starts to interact with the friends that it did not interact before may become an indicator of spamming activity.

Spammers are reactive

Unfortunately there is no way of building an ultimate spam filter that would solve the issue of spam forever. The spammers are reactive and they challenge every successful anti-spam action. As I have already mentioned, when there is a limitation of the number of different accounts accessed from one host within a time range, it might be skewed by accessing the accounts introducing a delay before switching from one to another. Another example would be a text based filter that might be attacked by obfuscating the words what makes them harder to tokenize, i.e W4TCH3S instead of WATCHES.

Summary

The mentioned factors are definitely not exhaustive and there are surely some others that might be applied. Secondly, better results might be obtained by choosing several features and combining them, as then it becomes harder to confuse the spam filter by manipulating one the features.

I hope that after reading this article you have a general overview how the identification of suspicious behaviour may look like and what kind of countermeasures are used to detect spurious behaviour.

References and further reading

[1] “Detecting Spammers on Social Networks”, G. Stringhini, C. Kruegel, G. Vigna
[2] “Towards Online Spam Filtering in Social Networks”, H. Gao, Y. Chen, K. Lee, D. Palsetia, A. Choudhary
[3] “Survey on Web Spam Detection: Principles and Algorithms”, N. Spirin, J. Han
[4] “A Survey of Learning-Based Techniques of Email Spam Filtering”, E. Blanzieri, A. Bryl

Support Vector Machine (SVM)

Please feel free to comment if you can see a mistake or the explanation is not clear enough, so that I can try to improve the article.

Introduction

This post is going to be slightly different than the others. I am not going to present a problem and an example solution, but rather than that, focus on one of most widely used approaches in machine learning. The reason why I have decided to describe Support Vector Machine, is the lack of a simple tutorial on the Internet, that would explain the theory in the way, that is understandable for everyone. The article bases mainly on two resources: the MIT lecture about SVM, and Stanford professor’s materials. I strongly recommend you to take a look at those materials, when you need a deeper understanding of the matter. However, the article is going to give you a general intuition and explain the math without going to much into details. After reading it, you should be able to make your own implementation of SVM.

Some machine learning review

As one example is worth more, than thousand of words, let us assume we have a set of text messages (i.e emails) and we face the problem of distinguishing between spam and ham. In the other words, we try to establish which messages are unwanted and what messages are expected. Every message is represented by some set of features. We will not go to much into details, what features should be chosen, neither what to do, to extract them. At the moment, it is enough to say that each message is represented by a feature vector\vec{x}. A label y is associated with every feature vector, that tells us, if the message is spam, or ham. The set of pairs {(\vec{x}_i,y_i)} is called learning set and it is used to learn the classifier.

We may consider every message as a point in a space, described by its coordinates (features). Obviously we have multidimensional vectors of coordinates (features), but for simplicity, let us assume we have only two dimensional vectors. We can depict them in the cartesian coordinate system. As we can see on the figure 1, the samples representing ham and spam, are located in different regions of the space. Is there a way to determine a line (more generally hyperlane) that would divide the sample space into regions representing positive and negative samples? Yes, there is and it is obtained by using a learning algorithm. Btw. this line is called decision boundary.

Figure 1. The decision boundary separating the set of spam and ham.

Figure 1. The decision boundary separating the set of spam and ham.

In the general case, the learning works by creating a parametrized function called hypothesis. While learning, we try to optimize the parameters, so that the error is the lowest. Yes, learning is an optimization problem. But what is the error and how do we measure it? It depends on the learning technique we use. It might be the least square error for the training set (used i.e in linear regression), or something more sophisticated (like in SVM approach).

The intuition of SVM

SVM tries to maximize the margin between the hyperplane and the samples. Thanks to that, the decisions it makes, are more accurate. Large distance between the decision boundary and the samples, makes the probability of missclassifying lower. This kind of classifiers are called large margin classifiers.

Fig. 2. The picture describes the concept of the road and its size.

Fig. 2. The picture describes the concept of the road and its size.

Another way of thinking about margins and their maximization, is to introduce the concept of the road. Let us imagine we have a road, that separates the positive and the negative samples (as you can see on the figure 2). We try to build the road in the way it becomes the widest possible. In the other words, to put the gutters (solid line on the figure 2) as far as possible from the center of the road (dashed line), so that the positive examples are located on one side of the road, and the negative examples on the other.

How it really works

Hmm… but how do we know, when the sample is located on the left, or on the right side of the row? Is there a numerical way of checking it? Let us assume, that we have a sample point u given, like on the figure 3. We can make a vector \vec{u} , and we want to establish on which side of the road the point is located. In this case we need another vector that will be perpendicular to center of the road. Why we need the second vector will be become obvious soon.

Figure 3. The green element is an element to be classified.

Figure 3. The green element is an element to be classified.

The choice of the vectors, that are perpendicular to the center of the road is quite wide. At the moment just assume that \vec{w} represent one of them. The dot product makes a projection of \vec{u} on the \vec{w} . The bigger the projection is, the further we move in the direction of the road center (the vector \vec{v} becomes longer). We may use this property to make our decision rule – to establish on which side of the road the considered element is. We may say, quite abstractly, that rule looks like \vec{w} \cdot \vec{u} >= c , where “c” is a certain constant.

When we consider c = -b, the above rule may take a form of \vec{w} \cdot \vec{u} + b >= 0 . We can make the rule even more stronger, as we want to maximize the distance and we can write it in the following way.

\vec{w} \cdot \vec{u} + b >= 1 for positive examples
\vec{w} \cdot \vec{u} + b <= -1 for negative examples

But ok, we have two types of classes we want to classify to, so the value of y might be 1 or -1. This allows us to multiple the above expression by y. Then we get:

y(\vec{w} \cdot \vec{u} + b) >= 1 for y = 1
y(\vec{w} \cdot \vec{u} + b) >= 1 for y = -1

Wow, it’s cool now we got one rule, for taking decisions. Let’s move the “one” the left side, so we obtain the following formula:

y(\vec{w} \cdot \vec{u} + b) - 1 >= 0 .

Let’s say the that for the elements, that are located in the gutter the following condition will be met:

Equation 1.

y(\vec{w} \cdot \vec{u} + b) - 1 = 0  

What we just obtained is extremely handy, as now we can measure the width of the road! Let us take a look at the figure 4. We have two elements located on the opposite gutters. The difference between vectors representing them, will give us the width of the street. Hmm.. not exactly, we also need to project the difference on the perpendicular vector.

Figure 4. Two elements located in the gutters of the road. The black vector represents the difference.

Figure 4. Two elements located in the gutters of the road. The black vector represents the difference.

After the consideration from the previous paragraph, here is what we get. Please note, that in this case we are using normalized version of \vec{w}.

width = (\vec{x_1} - \vec{x_2}) \cdot \frac{\vec{w}}{|\vec{w}|}

But wait, we already know what is vector \vec{x_1} and \vec{x_2} . Those are the vectors \vec{u} from the previous equation. So we can calculate \vec{u} from the equation 1 and insert into above formula. Here is the result of this operation. If you don’t trust me (and you shouldn’t!), check it please by yourself.

width = \frac {2} {|\vec{w}|}

As you can remember, we are expected to maximize this width. But in the form that we currently have, it is not very convenient. But we may transform the problem to a simpler one.

Equation 2. 

max \frac {1} {|\vec{w}|} = min |\vec{w}| = min \frac{1}{2} |\vec{w}|^2 

Just to recap, here is what obtained so far:

Equation 3.

min \frac{1}{2} |\vec{w}|^2 
y(\vec{w} \cdot \vec{u} + b) - 1 = 0 

Ok, so we have an optimization problem with some constraints. How to solve it then? It is actually quite simple if we remind the idea of the Lagrangian. Not to scare anyone, we will just say that this is the way of solving optimization problems with constraints. For more desperate and math hungry readers, please take a look at wiki pages. By using the Lagrangian we may reformulate the problem in the following way and stop being scared of the constraints.

Equation 4.

L = \frac{1}{2}|\vec{w}| - \sum_{i=1}^{n} \alpha_i (y_i (\vec{w} \cdot \vec{x_i} + b) - 1 ) 

Before you go further, let me make two remarks. Firstly, alpha_i are called lagrangian multipliers. Secondly in the sum, we are iterating over every element of the training set that contains pairs (y_i, x_i) , where the first element stands for the label of i-th element, and the second element is the feature vector of the i-th element. Obviously n stands for the size of the training set.

Now the problem becomes way simpler, as what we need to do, is to calculate partial derivatives. Most of us did in the high school (at least in my country), so it should not be a problem. Ah yeah, as we try to optimize, we need to find the values, where the derivatives are equal to zero. Let’s see then.

Equation 5.

\frac{\delta L}{\delta \vec{w}} = \vec{w} - \sum_{i=1}^{n} \alpha_i y_i x_i = 0  

Equation 6.

\frac{\delta L}{\delta \vec{b}} = \sum_{i=1}^{n} \alpha_i y_i = 0 

Equation 4 gives us directly the formula for the \vec{w} .

Equation 7.

\vec{w} = \sum_{i=1}^{n} \alpha_i y_i x_i 

Now, we can rewrite our decision rule, using the values we have just calculated:

Equation 8.

\sum_{i=1}^{n} \alpha_i y_i x_i +b >= 0 

Ok, we got something. But the problem of finding the values of \alpha_i still requires efficient algorithm. Before we dive into it, I would like to takes the values we just calculated and put them to the original Lagrangian formula as it is gonna bring us another interesting concept.

Equation 9.

L = \frac{1}{2}|\vec{w}| - \sum_{i=1}^{n} \alpha_i (y_i (\vec{w} \cdot \vec{x_i} + b) - 1 ) 
L = \frac{1}{2}(\sum_{i=1}^{n} \alpha_i y_i x_i)(\sum_{j=1}^{n} \alpha_j y_j x_j)  - (\sum_{i=1}^{n} \alpha_i y_i x_i)(\sum_{j=1}^{n} \alpha_j y_j x_j) - \sum_{i=1}^{n} \alpha_i y_i b + \sum_{i=1}^{n} \alpha_i 
L = \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_i \alpha_j  y_i y_j (\vec{x_i} \cdot \vec{x_j})

As we can see in the expression above, the optimization problem depends on the dot product of samples. We will use this idea in the next paragraph, as it brings a very useful feature.

Kernels

The SVM are very cool when the problem is linear separable. It means, that there exist a hyperplane, that divides the space into two parts, where one of them contains negative and the other positive samples. Unfortunately in the cruel world, not all the problems meet this property.

Figure 5. A set of samples that is not linear separable.

Figure 5. A set of samples that is not linear separable.

Is there a transformation of the original problem, to a different space, where it becomes linear separable? Yes, there is. We may denote this transformation as a function \phi(\vec{x}) . Hmmm… but take a look at the equation 9 from the last paragraph. Did not we agree, that the problem depends on the dot product of two sample vectors? I think we did. So instead of using dot product of two feature vectors, we may use the dot product in the transformed form \phi(\vec{x_i})\cdot \phi(\vec{x_j}) . For convenience we may introduce a function K(\vec{x_i},\vec{x_j}) = \phi(\vec{x_i})\cdot \phi(\vec{x_j}). This function is called kernel function. We do not have to know the exact transformation \phi(\vec{x}) , we just need to use a kernel function, that provides us the result of the dot product in the transformed form.

One of the most popular kernel functions is a kernel, that looks kind of like a gaussian. Let us take a look.

Equation 10.

K(\vec{x_i},\vec{x_j}) = e ^ -\frac{|x_i - x_j|}{\sigma} 

Depending on the choice of \sigma we may control the spread of the values of our kernel. There are also other kernel functions, please google them if you want to have a wider idea.

Regularization

There is one more thing left, that I should mention. While training the classifier we may fall into the trap of overfitting. A classifier may obtain a very good result for the training set, but it might be quite weak in generalization (classifying samples that were not present while training). There is a little big more to say about it, and if you are not familiar with the concept of regularization, please review it, as it is a very important matter in machine learning.

In case of SVM we may use the L1 regularization and change our equations to look slightly different.

Equation 11.

min \frac{1}{2}|w|^2 + B \sum_{i=1}^{n} |\delta_i| 
y_i (\vec{w_i} \cdot \vec{x_i} + b) - 1 = \delta_i 

Now there is some math, that we need to apply. But let me skip this stage and present the result directly. What we get is:

Equation 12. 

0 < \alpha_i < B 

How do we choose B? Hmm.. it looks like it is an opportunity to write another article about it. Do Cross Validation and Learning Curves sound familiar to you? If yes, then you are done, if not, you may google it, or I will try to explain it soon 🙂

Learning using Sequential Minimal Optimization (SMO)

Even if the concept of SVM was known for a long time, it did not get a wide scope of usage, because of the lack of efficient learning algorithm. Thanks God, there was the idea of SMO that was applied.

The idea bases on a learning method called coordinate ascent that might be described by the pseudo code below. Let us consider an unconstrained optimization problem: max W (\alpha_1, \alpha_1, ..., \alpha_n).

Listening 1.

repeat until convergence
{
   for i in [1..n] do
    \alpha_i = argmax_{\alpha_i} W (\alpha_1, ...,\alpha_{i-1}, \alpha_{i}, \alpha_{i+1}, ..., \alpha_n)  
}

What does the code above do? It chooses a parameter \alpha_i and considers the rest of the parameters as constants. Then W is going to optimized with respect to the parameter from the previous step. The choice is made using some heuristic, i.e the parameter that will bring the biggest optimization step is chosen. However in the example above, we just simply iterate over all the parameters [1..n].

It looks good, but in our case it is not going to be that simple. As you can remember we have some constraints in our problem that makes usage of coordinate ascent not that explicit. Do you remember this formula?

\sum_{i=1}^{n} \alpha_i y_i = 0

This is what blocks us from using coordinate ascent directly. Let me show you why.

Let us assume we want to keep \alpha_2, \alpha_3, ..., \alpha_n fixed and we try to optimize with respect to \alpha_1 . We may rewrite the previous equation in the following way.

\alpha_1 y_1 = -\sum_{i=1}^{n}
or when we multiply by y_1 (keep in mind it takes values 1 and -1).
\alpha_1 = - y_1 \sum_{i=1}^{n}

So as we can see \alpha_1 is determined by the other \alpha_i‘s. Now it becomes clear, why we can’t use the gradient ascent as it is given. Thus we need to update at least two \alpha_i at the same time. And here we are, here is the SMO algorithm described.

Listening 2.

repeat until convergence
{
   1. Select some pair \alpha_i , \alpha_j  (using some heuristic).
   2. Reoptimize W(\alpha) with respect to \alpha_i , \alpha_j 
      and keep the rest parameters fixed.
}

And here comes to best. The SMO algorithm is extremely efficient. Let me explain. Let us say that we choose \alpha_1 , \alpha_2 to be fixed. We got the following equation.

\alpha_1 y_1 + \alpha_2 y_2 = - \sum_{i=3}^{n} \alpha_i y_i

As the value on the right hand is constant, we may rewrite it as follows:

\alpha_1 y_1 + \alpha_2 y_2 = C

What gives us the possibility of redefining \alpha_1 as a function of \alpha_2.

\alpha_1 = (C - \alpha_2 y_2 )y_1

Now our optimization problem becomes way simpler than previously, as we are going to optimize with respect to only one parameter.

W(\alpha_1, \alpha_2, ..., \alpha_3) = W((C - \alpha_2 y_2 )y_1, \alpha_2, ..., \alpha_n)

When you keep in mind that our W is actually our L (Lagrangian), we get a quadratic function of \alpha_2. We can easily calculate the derivatives and set the value to zero to obtain the optimum.

There is one more thing to mention. Do you remember the extra constraint from the paragraph about regularization? I mean this one 0 <= \alpha_i <= B? It tells us that the values of \alpha_1, \alpha_2 have to be located within the box [0,B] x [0,B]. We have just agreed that \alpha_1 y_1 + \alpha_2 y_2 = C. This introduces us also another constraint L <= \alpha_2 <= H , otherwise \alpha_1, \alpha_2 cannot simultaneously satisfy both the box and the straight line constraint.

Figure 6. The graphical explanation of the constraints and clipping.

Figure 6. The graphical explanation of the constraints and clipping.

Now, we have everything what we need to formulate the update rule that looks as follows:

\alpha_{2 new} = argmax  W((C - \alpha_2 y_2 )y_1, \alpha_2, ..., \alpha_n) 
if \alpha_{2 new} > H then
    \alpha_{2 new} = H
else if  \alpha_{2 new} < L then
    \alpha_{2 new} = L

Summary

As I mentioned in the beginning, the SVM is a really powerful tool that is used in numbers of machine learning applications. The large margin approach makes our decision more accurate and thanks to the SMO we can learn the classifier very effectively. It is also worth remembering that by introducing the idea of kernels we minimize the problem being not linear separable what highly increases the numbers of problems we may solve. It also gives us a motto “When you get stuck, try to change the perspective” that might be applied not only in machine learning problems.

I hope that the article gave you some intuition and explain the basic concepts that were used in developing the idea of SVM.

Thanks for reading!

Joe Kidd

References

1. Artificial Intelligence by Patrick Winston, http://ocw.mit.edu/6-034F10
2. Machine Learning by Andrew Ng, http://cs229.stanford.edu/materials.html