Spam is a term that has been used for ages and has been experienced by numbers of users browsing the Internet, accessing email boxes, or checking their social networks accounts. The first two cases have been covered in the literature very well and there is no much left to say. However, the last one is a very interesting issue and the countermeasures used in social networks extend classical factors known from text spam or spurious pages detection.
Spam in social networks
Social networks seem to be a more fancy way of communication than emails. Extra features like walls, timelines, photo albums and nicer interfaces attract especially young people. However email is still used but it seems like it became a more formal form used in serious conversations.
The idea of Facebook is a network of trust, where the user is the one who decides what he wants to share and with who. It might be obtained by privacy settings, sending friend requests only to known people and confirming requests only from people that are known. Unfortunately the social phenomenon of gaining popularity and self-promotion via social platforms has skewed the assumptions of the network of trust. The results presented in  shows that 41% of Facebook users accept any invitation they receive, even if they met the sender before. Moreover, 45% of users click on the links posted by their connections even if they don’t know the person in the real life.
Most of spammers work in an automated way using bots executing commissioned tasks. One of the most widely use tactics against bots is letting the user to enter an input that might be easily provided by a human but brings lot’s of problems to machines. The classical example of such a technique is CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) that has been met by any of the Internet users at least once.
However asking a user to solve CAPTCHA every time he makes an action, might be really frustrating and leads to decrease of the number of users using the service. Therefore, there has to be a clever way of deciding when the extra test should be run. The purpose of this article is to give some examples of countermeasures that might be used to determine if the user is a human or a spamming bot. When the features are applied wit the support of machine learning we should be able to build a quite strong spam filter. Originally this article was way longer, but I have decided to split it into smaller parts.
Usually the first action that is taken by a spammer is to send a friend request, so that the spammer have access to the wall and contacts of the target as soon as the request is accepted. The same result might be obtained when a user allows a Facebook application to access his profile. Moreover, there are webpages that require access to user information and this strategy appears to be the most effective and successful according to my personal observation.
When the first condition is met and the spammer obtain the access he needs, he may start his campaign. There have been distinguished several categories of spammers behaviour:
– displayer: does not send messages, but only display them on their own pages (about section i.e.)
– bragger: posts messages to their own feed so they are visible for friends.
– poster: publishes messages to the victims walls.
– whisperer: sends private messages.
The most popular on Facebook seems to be poster strategy as it reaches the widest target. There is also another way of grouping spammers related to their spamming activity:
– greedy: spam is contained in every content that is published.
– stealhy: spam is published only once a time, meanwhile most of the messages are legitimate.
It seems to be obvious that the second type of spammers is way harder to detect. Luckily the spammers usually apply simple strategies and stealthy bots are rarely met. I guess it might be related to the costs of maintaining a stealthy mechanism as some extra work has to be done to generate legitimate content. Nevertheless the stealthy approach is more dangerous than the greedy one, as the user might be not aware of the fake identity of the sender. Secondly the sender builds trust as most of the messages are legitimate, what may lead to tries of pishing.
Countermeasures for spam detection
The countermeasures used for spam detection strictly depends on the nature of the the service. However there are some common factors that might be applied for all the platforms. One of them is Followers/Following Ratio. It is a comparison between the exact numbers of friends (followers) a user has, to the number of sent requests. A legitimate user is not accepted only exceptionally, as usually he sends requests to persons knowing him. The opposite happens for a spam account where most of its requests are rejected. As I mentioned in the previous paragraph, this idea might be extended for the problem of applications and webpages asking for access.
A very common technique is to consider how many accounts are accessed from one host within a time frame. As it has been already mentioned, spammers usually do not run all the actions manually as it would be to repetitive and wearisome for a human being, instead they run bots that do all the actions for them. A bot may try to access one account and after finishing its job, it switches to another one. If the platform identifies a try of access to numerous accounts within short period of time, the behavior seems to be suspicious. However if the bot is slow enough and doesn’t switch between accounts to fast, this measure might not be strong enough.
Another time range based observation is the activity of the users. Obviously, every user access the service at different time range and takes different type of actions. But it is quite uncommon to observe legitimate users login every night at 2 am and send hundreds of similar messages, or repeat the same or similar action in equal periods of time. This kind of action might be found as requiring further investigation. If there is only a finite number of actions that might be taken by a user, we may build a probabilistic model describing a typical user behaviour during a session. This technique is called anomaly detection and is extremely useful when it is hard to define what kind of behaviour we are going to consider as suspicious, but we have plenty of examples of users using the service in a normal way. I will try to take a look at this method more closely soon.
We have already mentioned the message similarity. In case the spammers posts messages, or comments that might be very similar to each other. If the platform identifies a user sending a certain set of information that differ only slightly, or don not differ at all, it might be considered as a spam approach. When there are numbers of similar messages send from different profiles, it is has a professional name Spam Campaign. Also in this place we may use probabilist model to describe language of legitimate messages and check new messages against this model. And also this I will try to describe in one of the next posts.
The ratio of sent messages that contain URLs to the amount of total sent messages might be also considered as a measure. Secondly spammers use “link shortener” services like tinyurl.com to hide the real identity of the webpage.
The spammers need to obtain the name of user they are going to target in some way. One very common technique is to target users coming from a prepared list i.e. add people with the name “Clint” or “Joe” and the last name “Kidd” or “Eastwood”. The previous approach could be extended by analyzing the social graph. Mostly people have clusters of friends, related to school, work, common interests and it is very likely that the person they are going to add will be associated with one of the clusters in some way. I.e. it could be common friends, common set of interests, or living place. When a user is trying to follow numbers of people with a very low association score to any of the clusters he has, it should be considered as potential anomaly. However the latter measure seems to be applicable for Facebook, but not much for services like Soundcloud as they do not provide reach enough social graph.
A very interesting measure that could be also applied is the interaction history. Although one account may establish a large number of social links in the social graph, it only interacts with a small subset of its friends. A sudden burst that the account starts to interact with the friends that it did not interact before may become an indicator of spamming activity.
Spammers are reactive
Unfortunately there is no way of building an ultimate spam filter that would solve the issue of spam forever. The spammers are reactive and they challenge every successful anti-spam action. As I have already mentioned, when there is a limitation of the number of different accounts accessed from one host within a time range, it might be skewed by accessing the accounts introducing a delay before switching from one to another. Another example would be a text based filter that might be attacked by obfuscating the words what makes them harder to tokenize, i.e W4TCH3S instead of WATCHES.
The mentioned factors are definitely not exhaustive and there are surely some others that might be applied. Secondly, better results might be obtained by choosing several features and combining them, as then it becomes harder to confuse the spam filter by manipulating one the features.
I hope that after reading this article you have a general overview how the identification of suspicious behaviour may look like and what kind of countermeasures are used to detect spurious behaviour.
References and further reading
 “Detecting Spammers on Social Networks”, G. Stringhini, C. Kruegel, G. Vigna
 “Towards Online Spam Filtering in Social Networks”, H. Gao, Y. Chen, K. Lee, D. Palsetia, A. Choudhary
 “Survey on Web Spam Detection: Principles and Algorithms”, N. Spirin, J. Han
 “A Survey of Learning-Based Techniques of Email Spam Filtering”, E. Blanzieri, A. Bryl