Spam Identifcation in Social Networks


Picture 1. The term spam comes from the sketch of Monty Python.

Picture 1. The term spam comes from the sketch of Monty Python.

Spam is a term that has been used for ages and has been experienced by numbers of users browsing the Internet, accessing email boxes, or checking their social networks accounts. The first two cases have been covered in the literature very well and there is no much left to say. However, the last one is a very interesting issue and the countermeasures used in social networks extend classical factors known from text spam or spurious pages detection.

Spam in social networks

Social networks seem to be a more fancy way of communication than emails. Extra features like walls, timelines, photo albums and nicer interfaces attract especially young people. However email is still used but it seems like it became a more formal form used in serious conversations.

The idea of Facebook is a network of trust, where the user is the one who decides what he wants to share and with who. It might be obtained by privacy settings, sending friend requests only to known people and confirming requests only from people that are known. Unfortunately the social phenomenon of gaining popularity and self-promotion via social platforms has skewed the assumptions of the network of trust. The results presented in [1] shows that 41% of Facebook users accept any invitation they receive, even if they met the sender before. Moreover, 45% of users click on the links posted by their connections even if they don’t know the person in the real life.

Most of spammers work in an automated way using bots executing commissioned tasks. One of the most widely use tactics against bots is letting the user to enter an input that might be easily provided by a human but brings lot’s of problems to machines. The classical example of such a technique is CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) that has been met by any of the Internet users at least once.

However asking a user to solve CAPTCHA every time he makes an action, might be really frustrating and leads to decrease of the number of users using the service. Therefore, there has to be a clever way of deciding when the extra test should be run. The purpose of this article is to give some examples of countermeasures that might be used to determine if the user is a human or a spamming bot. When the features are applied wit the support of machine learning we should be able to build a quite strong spam filter. Originally this article was way longer, but I have decided to split it into smaller parts.

Spammers groups

Usually the first action that is taken by a spammer is to send a friend request, so that the spammer have access to the wall and contacts of the target as soon as the request is accepted. The same result might be obtained when a user allows a Facebook application to access his profile. Moreover, there are webpages that require access to user information and this strategy appears to be the most effective and successful according to my personal observation.

When the first condition is met and the spammer obtain the access he needs, he may start his campaign. There have been distinguished several categories of spammers behaviour:

displayer: does not send messages, but only display them on their own pages (about section i.e.)
bragger: posts messages to their own feed so they are visible for friends.
poster: publishes messages to the victims walls.
whisperer: sends private messages.

The most popular on Facebook seems to be poster strategy as it reaches the widest target. There is also another way of grouping spammers related to their spamming activity:

greedy: spam is contained in every content that is published.
stealhy: spam is published only once a time, meanwhile most of the messages are legitimate.

It seems to be obvious that the second type of spammers is way harder to detect. Luckily the spammers usually apply simple strategies and stealthy bots are rarely met. I guess it might be related to the costs of maintaining a stealthy mechanism as some extra work has to be done to generate legitimate content. Nevertheless the stealthy approach is more dangerous than the greedy one, as the user might be not aware of the fake identity of the sender. Secondly the sender builds trust as most of the messages are legitimate, what may lead to tries of pishing.

Countermeasures for spam detection

The countermeasures used for spam detection strictly depends on the nature of the the service. However there are some common factors that might be applied for all the platforms. One of them is Followers/Following Ratio. It is a comparison between the exact numbers of friends (followers) a user has, to the number of sent requests. A legitimate user is not accepted only exceptionally, as usually he sends requests to persons knowing him. The opposite happens for a spam account where most of its requests are rejected. As I mentioned in the previous paragraph, this idea might be extended for the problem of applications and webpages asking for access.

A very common technique is to consider how many accounts are accessed from one host within a time frame. As it has been already mentioned, spammers usually do not run all the actions manually as it would be to repetitive and wearisome for a human being, instead they run bots that do all the actions for them. A bot may try to access one account and after finishing its job, it switches to another one. If the platform identifies a try of access to numerous accounts within short period of time, the behavior seems to be suspicious. However if the bot is slow enough and doesn’t switch between accounts to fast, this measure might not be strong enough.

Another time range based observation is the activity of the users. Obviously, every user access the service at different time range and takes different type of actions. But it is quite uncommon to observe legitimate users login every night at 2 am and send hundreds of similar messages, or repeat the same or similar action in equal periods of time. This kind of action might be found as requiring further investigation. If there is only a finite number of actions that might be taken by a user, we may build a probabilistic model describing a typical user behaviour during a session. This technique is called anomaly detection and is extremely useful when it is hard to define what kind of behaviour we are going to consider as suspicious, but we have plenty of examples of users using the service in a normal way. I will try to take a look at this method more closely soon.

We have already mentioned the message similarity. In case the spammers posts messages, or comments that might be very similar to each other. If the platform identifies a user sending a certain set of information that differ only slightly, or don not differ at all, it might be considered as a spam approach. When there are numbers of similar messages send from different profiles, it is has a professional name Spam Campaign. Also in this place we may use probabilist model to describe language of legitimate messages and check new messages against this model. And also this I will try to describe in one of the next posts.

The ratio of sent messages that contain URLs to the amount of total sent messages might be also considered as a measure. Secondly spammers use “link shortener” services like to hide the real identity of the webpage.

The spammers need to obtain the name of user they are going to target in some way. One very common technique is to target users coming from a prepared list i.e. add people with the name “Clint” or “Joe” and the last name “Kidd” or “Eastwood”. The previous approach could be extended by analyzing the social graph. Mostly people have clusters of friends, related to school, work, common interests and it is very likely that the person they are going to add will be associated with one of the clusters in some way. I.e. it could be common friends, common set of interests, or living place. When a user is trying to follow numbers of people with a very low association score to any of the clusters he has, it should be considered as potential anomaly. However the latter measure seems to be applicable for Facebook, but not much for services like Soundcloud as they do not provide reach enough social graph.

A very interesting measure that could be also applied is the interaction history. Although one account may establish a large number of social links in the social graph, it only interacts with a small subset of its friends. A sudden burst that the account starts to interact with the friends that it did not interact before may become an indicator of spamming activity.

Spammers are reactive

Unfortunately there is no way of building an ultimate spam filter that would solve the issue of spam forever. The spammers are reactive and they challenge every successful anti-spam action. As I have already mentioned, when there is a limitation of the number of different accounts accessed from one host within a time range, it might be skewed by accessing the accounts introducing a delay before switching from one to another. Another example would be a text based filter that might be attacked by obfuscating the words what makes them harder to tokenize, i.e W4TCH3S instead of WATCHES.


The mentioned factors are definitely not exhaustive and there are surely some others that might be applied. Secondly, better results might be obtained by choosing several features and combining them, as then it becomes harder to confuse the spam filter by manipulating one the features.

I hope that after reading this article you have a general overview how the identification of suspicious behaviour may look like and what kind of countermeasures are used to detect spurious behaviour.

References and further reading

[1] “Detecting Spammers on Social Networks”, G. Stringhini, C. Kruegel, G. Vigna
[2] “Towards Online Spam Filtering in Social Networks”, H. Gao, Y. Chen, K. Lee, D. Palsetia, A. Choudhary
[3] “Survey on Web Spam Detection: Principles and Algorithms”, N. Spirin, J. Han
[4] “A Survey of Learning-Based Techniques of Email Spam Filtering”, E. Blanzieri, A. Bryl


Support Vector Machine (SVM)

Please feel free to comment if you can see a mistake or the explanation is not clear enough, so that I can try to improve the article.


This post is going to be slightly different than the others. I am not going to present a problem and an example solution, but rather than that, focus on one of most widely used approaches in machine learning. The reason why I have decided to describe Support Vector Machine, is the lack of a simple tutorial on the Internet, that would explain the theory in the way, that is understandable for everyone. The article bases mainly on two resources: the MIT lecture about SVM, and Stanford professor’s materials. I strongly recommend you to take a look at those materials, when you need a deeper understanding of the matter. However, the article is going to give you a general intuition and explain the math without going to much into details. After reading it, you should be able to make your own implementation of SVM.

Some machine learning review

As one example is worth more, than thousand of words, let us assume we have a set of text messages (i.e emails) and we face the problem of distinguishing between spam and ham. In the other words, we try to establish which messages are unwanted and what messages are expected. Every message is represented by some set of features. We will not go to much into details, what features should be chosen, neither what to do, to extract them. At the moment, it is enough to say that each message is represented by a feature vector\vec{x}. A label y is associated with every feature vector, that tells us, if the message is spam, or ham. The set of pairs {(\vec{x}_i,y_i)} is called learning set and it is used to learn the classifier.

We may consider every message as a point in a space, described by its coordinates (features). Obviously we have multidimensional vectors of coordinates (features), but for simplicity, let us assume we have only two dimensional vectors. We can depict them in the cartesian coordinate system. As we can see on the figure 1, the samples representing ham and spam, are located in different regions of the space. Is there a way to determine a line (more generally hyperlane) that would divide the sample space into regions representing positive and negative samples? Yes, there is and it is obtained by using a learning algorithm. Btw. this line is called decision boundary.

Figure 1. The decision boundary separating the set of spam and ham.

Figure 1. The decision boundary separating the set of spam and ham.

In the general case, the learning works by creating a parametrized function called hypothesis. While learning, we try to optimize the parameters, so that the error is the lowest. Yes, learning is an optimization problem. But what is the error and how do we measure it? It depends on the learning technique we use. It might be the least square error for the training set (used i.e in linear regression), or something more sophisticated (like in SVM approach).

The intuition of SVM

SVM tries to maximize the margin between the hyperplane and the samples. Thanks to that, the decisions it makes, are more accurate. Large distance between the decision boundary and the samples, makes the probability of missclassifying lower. This kind of classifiers are called large margin classifiers.

Fig. 2. The picture describes the concept of the road and its size.

Fig. 2. The picture describes the concept of the road and its size.

Another way of thinking about margins and their maximization, is to introduce the concept of the road. Let us imagine we have a road, that separates the positive and the negative samples (as you can see on the figure 2). We try to build the road in the way it becomes the widest possible. In the other words, to put the gutters (solid line on the figure 2) as far as possible from the center of the road (dashed line), so that the positive examples are located on one side of the road, and the negative examples on the other.

How it really works

Hmm… but how do we know, when the sample is located on the left, or on the right side of the row? Is there a numerical way of checking it? Let us assume, that we have a sample point u given, like on the figure 3. We can make a vector \vec{u} , and we want to establish on which side of the road the point is located. In this case we need another vector that will be perpendicular to center of the road. Why we need the second vector will be become obvious soon.

Figure 3. The green element is an element to be classified.

Figure 3. The green element is an element to be classified.

The choice of the vectors, that are perpendicular to the center of the road is quite wide. At the moment just assume that \vec{w} represent one of them. The dot product makes a projection of \vec{u} on the \vec{w} . The bigger the projection is, the further we move in the direction of the road center (the vector \vec{v} becomes longer). We may use this property to make our decision rule – to establish on which side of the road the considered element is. We may say, quite abstractly, that rule looks like \vec{w} \cdot \vec{u} >= c , where “c” is a certain constant.

When we consider c = -b, the above rule may take a form of \vec{w} \cdot \vec{u} + b >= 0 . We can make the rule even more stronger, as we want to maximize the distance and we can write it in the following way.

\vec{w} \cdot \vec{u} + b >= 1 for positive examples
\vec{w} \cdot \vec{u} + b <= -1 for negative examples

But ok, we have two types of classes we want to classify to, so the value of y might be 1 or -1. This allows us to multiple the above expression by y. Then we get:

y(\vec{w} \cdot \vec{u} + b) >= 1 for y = 1
y(\vec{w} \cdot \vec{u} + b) >= 1 for y = -1

Wow, it’s cool now we got one rule, for taking decisions. Let’s move the “one” the left side, so we obtain the following formula:

y(\vec{w} \cdot \vec{u} + b) - 1 >= 0 .

Let’s say the that for the elements, that are located in the gutter the following condition will be met:

Equation 1.

y(\vec{w} \cdot \vec{u} + b) - 1 = 0  

What we just obtained is extremely handy, as now we can measure the width of the road! Let us take a look at the figure 4. We have two elements located on the opposite gutters. The difference between vectors representing them, will give us the width of the street. Hmm.. not exactly, we also need to project the difference on the perpendicular vector.

Figure 4. Two elements located in the gutters of the road. The black vector represents the difference.

Figure 4. Two elements located in the gutters of the road. The black vector represents the difference.

After the consideration from the previous paragraph, here is what we get. Please note, that in this case we are using normalized version of \vec{w}.

width = (\vec{x_1} - \vec{x_2}) \cdot \frac{\vec{w}}{|\vec{w}|}

But wait, we already know what is vector \vec{x_1} and \vec{x_2} . Those are the vectors \vec{u} from the previous equation. So we can calculate \vec{u} from the equation 1 and insert into above formula. Here is the result of this operation. If you don’t trust me (and you shouldn’t!), check it please by yourself.

width = \frac {2} {|\vec{w}|}

As you can remember, we are expected to maximize this width. But in the form that we currently have, it is not very convenient. But we may transform the problem to a simpler one.

Equation 2. 

max \frac {1} {|\vec{w}|} = min |\vec{w}| = min \frac{1}{2} |\vec{w}|^2 

Just to recap, here is what obtained so far:

Equation 3.

min \frac{1}{2} |\vec{w}|^2 
y(\vec{w} \cdot \vec{u} + b) - 1 = 0 

Ok, so we have an optimization problem with some constraints. How to solve it then? It is actually quite simple if we remind the idea of the Lagrangian. Not to scare anyone, we will just say that this is the way of solving optimization problems with constraints. For more desperate and math hungry readers, please take a look at wiki pages. By using the Lagrangian we may reformulate the problem in the following way and stop being scared of the constraints.

Equation 4.

L = \frac{1}{2}|\vec{w}| - \sum_{i=1}^{n} \alpha_i (y_i (\vec{w} \cdot \vec{x_i} + b) - 1 ) 

Before you go further, let me make two remarks. Firstly, alpha_i are called lagrangian multipliers. Secondly in the sum, we are iterating over every element of the training set that contains pairs (y_i, x_i) , where the first element stands for the label of i-th element, and the second element is the feature vector of the i-th element. Obviously n stands for the size of the training set.

Now the problem becomes way simpler, as what we need to do, is to calculate partial derivatives. Most of us did in the high school (at least in my country), so it should not be a problem. Ah yeah, as we try to optimize, we need to find the values, where the derivatives are equal to zero. Let’s see then.

Equation 5.

\frac{\delta L}{\delta \vec{w}} = \vec{w} - \sum_{i=1}^{n} \alpha_i y_i x_i = 0  

Equation 6.

\frac{\delta L}{\delta \vec{b}} = \sum_{i=1}^{n} \alpha_i y_i = 0 

Equation 4 gives us directly the formula for the \vec{w} .

Equation 7.

\vec{w} = \sum_{i=1}^{n} \alpha_i y_i x_i 

Now, we can rewrite our decision rule, using the values we have just calculated:

Equation 8.

\sum_{i=1}^{n} \alpha_i y_i x_i +b >= 0 

Ok, we got something. But the problem of finding the values of \alpha_i still requires efficient algorithm. Before we dive into it, I would like to takes the values we just calculated and put them to the original Lagrangian formula as it is gonna bring us another interesting concept.

Equation 9.

L = \frac{1}{2}|\vec{w}| - \sum_{i=1}^{n} \alpha_i (y_i (\vec{w} \cdot \vec{x_i} + b) - 1 ) 
L = \frac{1}{2}(\sum_{i=1}^{n} \alpha_i y_i x_i)(\sum_{j=1}^{n} \alpha_j y_j x_j)  - (\sum_{i=1}^{n} \alpha_i y_i x_i)(\sum_{j=1}^{n} \alpha_j y_j x_j) - \sum_{i=1}^{n} \alpha_i y_i b + \sum_{i=1}^{n} \alpha_i 
L = \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_i \alpha_j  y_i y_j (\vec{x_i} \cdot \vec{x_j})

As we can see in the expression above, the optimization problem depends on the dot product of samples. We will use this idea in the next paragraph, as it brings a very useful feature.


The SVM are very cool when the problem is linear separable. It means, that there exist a hyperplane, that divides the space into two parts, where one of them contains negative and the other positive samples. Unfortunately in the cruel world, not all the problems meet this property.

Figure 5. A set of samples that is not linear separable.

Figure 5. A set of samples that is not linear separable.

Is there a transformation of the original problem, to a different space, where it becomes linear separable? Yes, there is. We may denote this transformation as a function \phi(\vec{x}) . Hmmm… but take a look at the equation 9 from the last paragraph. Did not we agree, that the problem depends on the dot product of two sample vectors? I think we did. So instead of using dot product of two feature vectors, we may use the dot product in the transformed form \phi(\vec{x_i})\cdot \phi(\vec{x_j}) . For convenience we may introduce a function K(\vec{x_i},\vec{x_j}) = \phi(\vec{x_i})\cdot \phi(\vec{x_j}). This function is called kernel function. We do not have to know the exact transformation \phi(\vec{x}) , we just need to use a kernel function, that provides us the result of the dot product in the transformed form.

One of the most popular kernel functions is a kernel, that looks kind of like a gaussian. Let us take a look.

Equation 10.

K(\vec{x_i},\vec{x_j}) = e ^ -\frac{|x_i - x_j|}{\sigma} 

Depending on the choice of \sigma we may control the spread of the values of our kernel. There are also other kernel functions, please google them if you want to have a wider idea.


There is one more thing left, that I should mention. While training the classifier we may fall into the trap of overfitting. A classifier may obtain a very good result for the training set, but it might be quite weak in generalization (classifying samples that were not present while training). There is a little big more to say about it, and if you are not familiar with the concept of regularization, please review it, as it is a very important matter in machine learning.

In case of SVM we may use the L1 regularization and change our equations to look slightly different.

Equation 11.

min \frac{1}{2}|w|^2 + B \sum_{i=1}^{n} |\delta_i| 
y_i (\vec{w_i} \cdot \vec{x_i} + b) - 1 = \delta_i 

Now there is some math, that we need to apply. But let me skip this stage and present the result directly. What we get is:

Equation 12. 

0 < \alpha_i < B 

How do we choose B? Hmm.. it looks like it is an opportunity to write another article about it. Do Cross Validation and Learning Curves sound familiar to you? If yes, then you are done, if not, you may google it, or I will try to explain it soon 🙂

Learning using Sequential Minimal Optimization (SMO)

Even if the concept of SVM was known for a long time, it did not get a wide scope of usage, because of the lack of efficient learning algorithm. Thanks God, there was the idea of SMO that was applied.

The idea bases on a learning method called coordinate ascent that might be described by the pseudo code below. Let us consider an unconstrained optimization problem: max W (\alpha_1, \alpha_1, ..., \alpha_n).

Listening 1.

repeat until convergence
   for i in [1..n] do
    \alpha_i = argmax_{\alpha_i} W (\alpha_1, ...,\alpha_{i-1}, \alpha_{i}, \alpha_{i+1}, ..., \alpha_n)  

What does the code above do? It chooses a parameter \alpha_i and considers the rest of the parameters as constants. Then W is going to optimized with respect to the parameter from the previous step. The choice is made using some heuristic, i.e the parameter that will bring the biggest optimization step is chosen. However in the example above, we just simply iterate over all the parameters [1..n].

It looks good, but in our case it is not going to be that simple. As you can remember we have some constraints in our problem that makes usage of coordinate ascent not that explicit. Do you remember this formula?

\sum_{i=1}^{n} \alpha_i y_i = 0

This is what blocks us from using coordinate ascent directly. Let me show you why.

Let us assume we want to keep \alpha_2, \alpha_3, ..., \alpha_n fixed and we try to optimize with respect to \alpha_1 . We may rewrite the previous equation in the following way.

\alpha_1 y_1 = -\sum_{i=1}^{n}
or when we multiply by y_1 (keep in mind it takes values 1 and -1).
\alpha_1 = - y_1 \sum_{i=1}^{n}

So as we can see \alpha_1 is determined by the other \alpha_i‘s. Now it becomes clear, why we can’t use the gradient ascent as it is given. Thus we need to update at least two \alpha_i at the same time. And here we are, here is the SMO algorithm described.

Listening 2.

repeat until convergence
   1. Select some pair \alpha_i , \alpha_j  (using some heuristic).
   2. Reoptimize W(\alpha) with respect to \alpha_i , \alpha_j 
      and keep the rest parameters fixed.

And here comes to best. The SMO algorithm is extremely efficient. Let me explain. Let us say that we choose \alpha_1 , \alpha_2 to be fixed. We got the following equation.

\alpha_1 y_1 + \alpha_2 y_2 = - \sum_{i=3}^{n} \alpha_i y_i

As the value on the right hand is constant, we may rewrite it as follows:

\alpha_1 y_1 + \alpha_2 y_2 = C

What gives us the possibility of redefining \alpha_1 as a function of \alpha_2.

\alpha_1 = (C - \alpha_2 y_2 )y_1

Now our optimization problem becomes way simpler than previously, as we are going to optimize with respect to only one parameter.

W(\alpha_1, \alpha_2, ..., \alpha_3) = W((C - \alpha_2 y_2 )y_1, \alpha_2, ..., \alpha_n)

When you keep in mind that our W is actually our L (Lagrangian), we get a quadratic function of \alpha_2. We can easily calculate the derivatives and set the value to zero to obtain the optimum.

There is one more thing to mention. Do you remember the extra constraint from the paragraph about regularization? I mean this one 0 <= \alpha_i <= B? It tells us that the values of \alpha_1, \alpha_2 have to be located within the box [0,B] x [0,B]. We have just agreed that \alpha_1 y_1 + \alpha_2 y_2 = C. This introduces us also another constraint L <= \alpha_2 <= H , otherwise \alpha_1, \alpha_2 cannot simultaneously satisfy both the box and the straight line constraint.

Figure 6. The graphical explanation of the constraints and clipping.

Figure 6. The graphical explanation of the constraints and clipping.

Now, we have everything what we need to formulate the update rule that looks as follows:

\alpha_{2 new} = argmax  W((C - \alpha_2 y_2 )y_1, \alpha_2, ..., \alpha_n) 
if \alpha_{2 new} > H then
    \alpha_{2 new} = H
else if  \alpha_{2 new} < L then
    \alpha_{2 new} = L


As I mentioned in the beginning, the SVM is a really powerful tool that is used in numbers of machine learning applications. The large margin approach makes our decision more accurate and thanks to the SMO we can learn the classifier very effectively. It is also worth remembering that by introducing the idea of kernels we minimize the problem being not linear separable what highly increases the numbers of problems we may solve. It also gives us a motto “When you get stuck, try to change the perspective” that might be applied not only in machine learning problems.

I hope that the article gave you some intuition and explain the basic concepts that were used in developing the idea of SVM.

Thanks for reading!

Joe Kidd


1. Artificial Intelligence by Patrick Winston,
2. Machine Learning by Andrew Ng,

K-way merge

There is a certain number if sorted arrays and each of them contains numbers. We want to take an element from every array in such a way, that the difference between the elements is the lowest.

For simplicity let’s say N=3 and the arrays are like:

A = {4, 10, 15, 20}
B = {1, 13, 29}
C = {5, 14, 28}

Now we need to choose one element from each of array, let’s call them a, b, c, such that |a – b| + |b – c| + |c – a| is minimal. For this case the solution is a = 15, b = 13, c = 14. But how to find it?

For solving the problem we are going to use minimal heap data structure with constant size. In our case the size of the heap is equal to number of arrays we process, so the size = 3. In the first step we take the minimal element from the array, and we store the difference of its sum. Now we are going to remove the element from the top of the heap and replace it with the next element from the array the removed element comes from. Inserting the a new element into the heap will rearrange the elements so that the minimal one will be always on the top. We store again the new difference if it’s smaller than the previous one and we repeat until one of the arrays becomes empty. Let’s take a look at an example:

H = {1, 4, 5}, diff = 8
we remove 1 and put 13 as 1 belonged to the same array
H = {4, 5, 13}, diff = 18
we remove 4 and put 10 as 4 belonged to the same array
H = {5, 10, 13}, diff = 16
we remove 5 and put 14 as 5 belonged to the same array
H = {10, 13, 14}, diff = 8
we remove 10 and we put 15 as 10 belonged to the same array
H = {13, 14, 15}, diff = 4

I was inspired to describe this solution, that is called k-way merged, after meeting several questions related to it on careercup – but without a very easy explanation. I hope it helps a bit. Actually it helps at least me 🙂

Best Regards

Joe Kidd

Dutch flag problem – variation

We are given an array and we need to sort it in the way, so that all negative numbers go to the beginning, all positive numbers to the end and the relative position of numbers is not changed. In example: [-1, 4, 0, -2, 2] => [-1, -2, 0, 4, 2].

Basically I have promised to myself to stop updating this blog for a while, as I am kinda busy recently, but this question surprised me and couldn’t imagine there is a such long discussion related to it.

The problem has been described at, and I can’t believe no one has noticed such a simple solution. So basically we are given an array of numbers, where some of them are negative and some of them positive. We need to sort the array in the way, negative numbers go first, positive after, but the relative position is not changed. This means that if we have an array [-1,3,-3,2] the answer should be [-1,-3,3,2]. So what we need is a stable sort. However it would be still not enough. But what if we don’t perceive numbers as numerical values, but we assign them one of three types: positive, null, negative? Then we are done actually.

In the other words, let’s assign all the negative numbers the value -1 and all positive numbers value 1. Let 0 be 0 😉 So we may write a following implementation in c++:

bool cmp(int a, int b)
// consider negative numbers as the same class
if(a < 0 && b < 0) return false;
// consider positive numbers as the same class
else if(a > 0 && b > 0) return false;
// otherwise normal comparison will do. This includes
// also the case when a = 0 or b == 0 then we use the normal comparison
else return a < b

void special_sort(vector &amp; special_vector)

So what we did, we just solved this much discussed problem using nothing more than sorting and STL library to get the complexity of O(n logn) that matches the best solutions from 🙂

Best Regards

Joe Kidd

Useful sums and formulas

In algorithmic challenges you may very often meet problems that are somehow related to mathematical series, distributions, or combinatorics. As I not always remeber all the formulas I have created a quick reference.

1. Arithmetic Progression and Series

a) Sum
S_n = \frac{n}{2}(a_1 + a_n)
b) N-th element of the progression
a_n = a_1 + (n-1)d
Where r stands for the difference

2. Geometric Progression and Series

a) Sum
S_n = a_1\frac{(1 - r^n)}{1-r}
b) N-th element of progression
a_n = a_1 * r ^ n
Where r stands for ratio.

3. Gauss sum
\sum\limits_{k=1}^n k = \frac{n(n + 1)}{2}

4. Gauss series
\sum\limits_{k=1}^{\infty} (k+1)z^k = \frac{1}{(1 - z)^2}

5. Sum of squares
\sum\limits_{k=1}^{n} k^2 = \frac{n(n+1)(2n+1)}{6}

6. Factorials sum
\sum\limits_{k=0}^{n} k(k+1) = \frac{n(n+1)(n+2)}{3}

7. Binomial theorem
\sum\limits_{k=0}^{n}  \binom {n} {k}x^ky^{n-k} = (x+y)^n

8. Harmonic function
ln( \frac{1}{1-z})= \sum\limits_{k=0}^{\infty}(z^k/k)

In the next article we will summarise probability.

Best Regards

Joe Kidd

Check if for a given matrix size, you are able to visit every cell

This problem was described in TopCoder SRM 594, but it’s simplification is described above. To sum up, we are given a matrix of size N, M and we need to check if it’s feasible to visit every cell in the matrix, traveling from cell to cell according to a simple rule: from position [i, j] we may go only to [i+1, j+1]. This problem is REALLY cool and strongly encourage you to read the solution when you got here!

The simplest solution would be just to create such a matrix and then start visiting it, until we realize we can’t visit it completely, or we do it. What should be a starting position of such a traversal? As we need to visit all the cells, we may choose randomly any of them as a starting position. However the solution would work, it’s not satisfactory as the complexity is O(M*N), what sucks.

Instead of thinking about the algorithm it would be good to stop here for a while and wonder when the matrix might be fully visited.

Did you really think? Ok, so then surely you know the solution. Anyway I will provide it here for those who cheated. The answer is: when the GCD of M, N is 0. Why? Because in this case the Lowest Common Multiple will be equal to M*N. And add it this information to your heart. What does it mean for us? It means the first element that is going to be visited second time, will be visited after M*N moves. As the size of the matrix is M*N, it means that all the elements of the matrix will be visited before we will reach an element the second time. Otherwise, when we meet an element already visited, before M*N steps we are sure that skipped some cells and we will never reach them, as the traversing methods remains the same.

Best Regards

Joe Kidd

GCD of two numbers when one is very big

On the standard input, we are given two numbers, A and B, where 0 <= B <= 40000 and B <= A < 10^250. We need to find the greatest common divisor of them.

The statement of this problem originally comes from CodeChef pages. The problem itself is very simple, if you know the trick. The well known gcd algorithm is:

gcd(A, B) = gcd(B, A % B)
gcd(A, 0) = A

Yeah, but keep in mind that one number is pretty big and will not fit in a standard type, so the code above is not really useful now. However, we could go for a type like BigInteger, but it’s not really necessary. Let’s imagine that the big number is given as a string for now. We can then construct the integer from the string using iterative approach. When we recall the following property of modulo sum, actually we are home:

(A + B + C + D) \% N = (A\%N + B\%N + C\%N + D\%N)\%N

What leads us to following conversion code:

unsigned int convert(string s, unsigned int b)
    unsigned int answer = 0;
    for(int i = 0; i < s.size(); i++)
       answer *= 10;
       answer += s[i] - '0';
       answer = answer % b;
    return answer;

Now, if we have the result of covert method, we may use it as a parameter to gcd algorithm.

Best Regards

Joe Kidd

Greatest Common Divisior of a sequence of numbers

It’s well known how to calculate a greatest common divisor of two numbers. But what if we are considering a sequence of them and we are expected to return gcd?

This probably not necessary, but just to keep thing in order the greatest common divisor algorithm might be defined in a following way:

gcd(A, B) = gcd(B, A % B)
gcd(A, 0) = A

The solution of the problem lays in the property of greatest common divisor. The gcd is associative what might be expressed:

gcd([A,B,C,D, ..., Z]) = gcd(gcd(gcd(gcd(gcd(A,B), C) D),...), Z)

A, B, C, D… are some integers. Done 🙂

Best Regards

Joe Kidd

Convert a mathematical expression into RPN

This is a very common technique and actually it has been described in thousand of places. We are given a string representing a mathematical expression, and we are expected to convert it into Reverse Polish Notation what makes evaluation way easier. As an example let’s consider a following expression 2 + 5 * 4 that in a RPN has a form of 2 5 4 * +.

So we have an expression given in an infix notation and we need to convert it to RPN. Not to make the problem to difficult we assume there are no function operators, nor separators and neither unary minus. So basically we have only positive numbers, parenthesis and basic arithmetic operations like +, -, *, /. And what is the most important we assume the expression is correct, so no need to validate (even if it was pretty simple it’s not a part of the problem).

The conversion algorithm has been created by Dijkstra and is called Shunting-yard algorithm. The algorithm uses two data structures, a queue Q and a stack S:

while there are tokens to read
    t = the next token
    if isNumber(t) then Q.push(t)
    if isOperator(t) then
        while isOperator(
              and priority( > priority(t) then
    if isLeftParenthesis(t) then
    if isRightParenthesis(t) then
       while ! isLeftParenthesis( then
while S.empty() == false

An interesting part is in the line 6 of this code, where we are considering the order of operations. As we know the operations * and / have higher priority than + and – so they should be executed before. Then we are poping them from the stack and putting into the queue.

Best Regards

Joe Kidd

Counting factorials of relatively small numbers

We are given a big number, let’s say 345. It’s not big? It is, try to calculate the factorial. I will present a solution that might be required in the problem, however it’s mainly used for big number multiplication.

This solution is quite tricky but very easy to implement and explain. I think it’s worth being aware of it as it looks simple enough to be a problem for an interview. We may just keep the big number in an array and calculate the multiplication in the way mentioned below. So let’s say we have two numbers A = 345 B = 15, and A is kept in an array A = [5, 4, 3]. Now we multiply in the following way:

carry = 0;
Step 1:
array[0] = (array[0] * B + carry) % 10 =
 (5 * 15 + 0 ) % 10 = 75 % 10 = 5
carry = (array[0] * B + carry) / 10 = 75/10 = 7
Step 2:
array[1] = (array[1] * B + carry) % 10 = 
 (4* 15 + 7 ) % 10 = 67 % 10 = 7
carry = (array[1] * B + carry) / 10 = 67/10 = 6
Step 3:
array[2] = (array[2] * B + carry) % 10 = 
 (3* 15 + 6 ) % 10 = 51 % 10 = 1
carry = (array[2] * B + carry) / 10 = 51/10 = 5

Because carry is greater than zero we need to add it to the array and the result array is [5, 7, 1, 5], what gives a number 5175. Please check if the multiplication result is correct indeed.

Best Regards

Joe Kidd