How many web pages do you find valuable?
Even though more than 14 billion web pages indexed on the web according to worldwidewebsize, you can’t even name 1 million useful pages.
Why these other pages are for?
Most of the pages are spams.We all know about the spam’s we receive on emails , even bill gates received lot of spams.At present I am talking about search result spam which is much larger problem than email spamming. Last Night, I was watching discussion of Matt Cutts (Google), Harry Shum(Bing) and Rich Skrenta (Blekko).Even though they have differences , majority of their discussion was focused on improving search results. Moreover about the challenges they are facing of Spam websites and Irrelevant Search Results.
What we can do?
I thought it is important for me to discuss this problem with computer engineering graduates on this blog. If you can build good spam filter for search engines, than it would be great help for millions of users. In this article I will talk about reasons of spam, what big companies are doing ?And, last but not least how can you develop spam filter algorithm.
What these spam websites are and why they are polluting our web?
Spam Websites are sites having content which is not relevant for end user.They are polluting our web for MONEY.
For better perceptive, you need to understand commercial side of it. Most of the genuine website owners make money by referrals systems or advertisements published on their website. Moreover, the amount of money they generate is directly proportional to the number of visitors they have. Large chunk of visitors come from search engines.
To earn quick money spam website publisher build their site according to search engines and not for the end users. After optimization, site starts receiving traffic, this doesn’t look that bad but when visitor doesn’t find relevant content on this site, it frustrate him and he thinks that this particular search engine is not relevant or query doesn’t have significant results available on internet. Worst part of it is that end users are us.
Is it possible to optimize website without having content?
Yes, it is difficult but possible. I give you example of 1995 when Google page rank algorithm was not there. Anyone could optimize his site with domain name of query and repeating strings.
Example: To optimize site for “xyz” query in 90’s spam website owner buy domain name xyz dot com. And repeat xyz query many times. When someone searches that person got that site as first result.
Google search engine tweaks its algo’s in big way to filter these sites, but with them spammers become smarter too. They are now targeting Google page rank algorithm. Even though search engines like Blekko are developing algorithms such as Adspam for blocking sites with spam ,it still needs lot of improvement.
What are the common things in Spam sites?
Common things in spam sites are:
- They all have some type of advertisement or referral associated with them.
- They do not have more than 20 pages of original content.
- There domain names are really big.
- They give links of many irrelevant websites (Sometime more than 20).
- Sometime there font color and back ground color is same.
- Send content with URL links.
How to develop search engine spam filtering System?
As mentioned above if we can recognize spams, we can easily filter them:
Now to filter most of the spam’s, find out sites which has all the above points such as advertisement or referral, not more than 20 pages of content, its domain name is big, irrelevant website links or font color and background color is same than on experience I can say that 99 percent of time this website is not there to help user but to misguide him for money.
So our spam filtering algorithm should do following tests:
Count number of words in URL (Use fopen and count function)
Most of the big and useful site have domain name such as Google, Amazon, Bing, Wikipedia or iProject Ideas. While genuine site owners want visitors to remember their website name, spam sites depends on search engines.
So check whether domain name is more than 20 words or not(use if else loop )
Check number of links in that page having “get” method associate with them.(Use regular expression )
The difference between simple URL and get method URL is: simple url will look like www . xyz . com/ and get method URL is like www . xyz .com?abc=4
I am giving much emphasis on counting URL‘s with get methods because referral to other site doesn’t always require us to send any data with URL. For e.g. if I give url of Wikipedia, I don’t need to send get data with it.
Check font and background color
If font color and background color is same, this clearly signifies that owner of this site has something to hide from user but want search engine to read it.
Check whether given links of relevant websites
You can check there quality by applying above steps on these sites.
In the above paragraphs I try to provide you help regarding spam filters. I hope it helped. If you have any query regarding this project, you can leave comment.
Filtering spam using search engines
Related Projects :