Casual Articles
#1 in Business Subscribe Email Print

You are here: Home > Internet and Businesses Online > SEO > Search Engines vs. SEO Spam: Statistical Methods

Tags

  • operation
  • yourself
  • november
  • currently available
  • credit related
  • business owners

  • Links

  • Use of Mini and Micro Grafts in Hair Transplant Surgery
  • How To Find Golf Instruction Online Free
  • Blackberry 8800 New Style Mantra
  • Casual Articles - Search Engines vs. SEO Spam: Statistical Methods

    Top Ten Small Business Mastermind Advisers All Small Business Owners Need To Have to Succeed
    The statistics on small businesses going broke in the first 12 months of operation are nothing short of obscene and seriously scary. In Australia and other western countries such as the United States 70% of all small businesses fail within the first 12 months of operation but let me tell you from experience, surviving after that 12 months is no less harrowing.After 10 years of running four small businesses and creating them from scratch I can tell you with some authority, that I did not do this on my own. In fact, I reckon I have made every mistake in the book on how not to run a small business, but yet I have still survived. The secret to staying in business is all down to being able to talk to my ten Small Business Mastermind Advisers.My ten Small Business Mastermind Advisers are there as my support team in helping me make the right decisions. See often when we make a decision in small business, it might be right at the time but down the track it can do you a lot of harm. Having your small business mastermind advisers on call, you can simply call them and ask them the consequences of the choices you are about to make.For example, having the right business structure and putting your business assets in the right structure will play a major roll in the success of your business when you decide to exit the business. Se e most people who go into business only ever think of the business as a job they do not look at it from the perspective of how they will exit the business when they have built it into an enterprise.Those ten small business maste
    asy to detect using statistical methods.

    For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).

    Content Evolution

    The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

    The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

    Clustering Properties

    Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

    To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on

    Practical Affiliate Marketing Tips
    So you’ve set your sights on a particular niche product from an affiliate program and you are all set to make your kill on the internet. You ask yourself, what do I do next and where can I get my first customer?Before you start running to google ( adwords ) and put your affiliate link on your ad. Note that google no longer allows affiliate links to be advertised directly. You need to create your own sale page or landing page and advertise it instead. ( no such constrain on other PPC company )As in any internet marketing. You need customer, otherwise your business is as good as dead. You can get traffic from many different sources. Some you pay, some are free but you may have to do some work.The following are they different types of online traffic you may want to consider.• Pay per click• Traffic exchange• Safelist• Article writing• Ezines• SEO• Email marketingWhichever of the above method you choose to get your traffic . You need to make full use of them by creating your own list . ( customer data base ) The typical statistic for first time buyers online is around 0.5% ( less than one out of every 100 visitors ) If you pay for these traffic and they do not buy the first time, they are gone forever. That is why you need to capture them in a list and contact then in the near future to increase you chances of sales. Statistics shows that customer usually buy after the third round of persuasion.In real life marketing, customers get to see you and yet not all will buy from you. Now put yoursel
    High placement in a search engine is critical for the success of any online business. Pages appearing higher in the search engine results to queries relevant to a site's business will get higher targeted traffic. To get this kind of competitive advantage Internet companies employ various SEO techniques in order to optimize certain factors used by search engines to rank results. In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called "black-hat" SEO.

    'Black Hat' SEO and Search Engine Spam

    The oldest and simplest "black SEO" strategy is adding a variety of popular keywords into web pages to make them rank high for popular queries. This behavior is easily detected since generally such pages include unrelated keywords that lack topical focus. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. However "black-hat' SEO went one step further creating the so-called "doorway' pages - tightly focused pages consisting of a bunch of keywords relevant to a single topic. In terms of keyword density such pages are able to rank high in search results but never seen by human visitors as they are redirected to the page intended to receive the traffic. Another trend is the abusing the link popularity based ranking algorithms, such as PageRank with the help of dynamically-generated pages. Such pages receive the minimum guaranteed PageRank and the small endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines constantly improve their algorithms trying to minimize the effect of "black-hat"' SEO techniques, but SEOs also persistently respond with new more sophisticated and technically advanced tricks so that this process bears a resemblance to an arms race.

    "Black-hat" SEO is responsible for the immense amount of search engine spam -- pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.

    Using Statistics to Detect Search Engine Spam

    An example of an application of statistical methods to detect web spam is presented in the paper "Spam, Damn Spam and Statistics" by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.

    Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects - the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).

    The research concentrates on studying the following properties of web pages:

    • URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.).
    • Host name resolutions.
    • Linkage properties.
    • Content properties.
    • Content evolution properties.
    • Clustering properties.

    URL Properties

    Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.

    The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits -- and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.

    Host Name Resolutions

    One can notice that Google, given a query q, tends to rank a page higher if the host component of the page's URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.

    This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs -- to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.

    To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.

    Linkage Properties

    The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.

    In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.

    Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.

    Content Properties

    Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.

    For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).

    Content Evolution

    The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

    The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

    Clustering Properties

    Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

    To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on n

    Boost Your Affiliate Commissions Overnight!
    The ideal world of affiliate marketing does not necessitate having your own website, dealing with customers, refunds, product discovery or upkeep. It is also one of the more effortless means of launching into an online business that will earn you significant income.Whether you have already become involved with an affiliate program or are only just considering the idea, the following tips will help you boost the income that you can earn.The first one may seem obvious, but many people don't do their homework. You need to know what the best program and products are to support.Certainly, you want to support a program that will permit you to attain the greatest profits in the shortest possible time, but there are several other aspects to consider when choosing a program.Don't only select the ones that have a generous commission arrangement and a stable track record of paying their affiliates efficiently but also make sure the products fit in with your target audience.There are thousands of affiliate programs online which means you can afford to be picky. It's of uppermost importance to select wisely in order to help guarantee your success.2. Write free reports or short ebooks to distribute from your site. There is a great possibility that you will be competing with other affiliates that are advertising the same program. If you begin writing short reports related to the product you are promoting, you'll be able to distinguish yourself from the other affiliates.In the reports, provide some useful information for free. If possi
    semblance to an arms race.

    "Black-hat" SEO is responsible for the immense amount of search engine spam -- pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.

    Using Statistics to Detect Search Engine Spam

    An example of an application of statistical methods to detect web spam is presented in the paper "Spam, Damn Spam and Statistics" by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.

    Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects - the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).

    The research concentrates on studying the following properties of web pages:

    • URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.).
    • Host name resolutions.
    • Linkage properties.
    • Content properties.
    • Content evolution properties.
    • Clustering properties.

    URL Properties

    Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.

    The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits -- and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.

    Host Name Resolutions

    One can notice that Google, given a query q, tends to rank a page higher if the host component of the page's URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.

    This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs -- to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.

    To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.

    Linkage Properties

    The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.

    In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.

    Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.

    Content Properties

    Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.

    For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).

    Content Evolution

    The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

    The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

    Clustering Properties

    Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

    To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on

    What Makes A Small Business Owner's Life So Stressful
    Ego Makes Us Do Crazy Things – Even Work Harder, Longer, Alienate Our Families and Make Less Money.American Express recently did a survey of Canadian small business owners as part of their overall marketing strategy to determine the attitudes, perceptions and insights of small business owners towards starting and running a small business and what keeps them motivated.The results were staggering but not surprising. Here are some of the take away points.Small business owners put in an average of 55 hours a weekSmall business owners spend almost every weekend catching up on work and with no plans to take any time-off but they wouldn't have it any other way.62% of respondents said not taking regular vacations to decompress and re-energize doesn’t bother them.64% said they wouldn’t think twice about doing it all again.The survey goes on to say that the respondents said they wouldn’t have it any other way. The thrill of running a business is derived form having a hand in every part of it.Most small business owners surveyed said the number one reason for not taking more time off was due to not wanting to relinquish control and the worries about not making money while they are out of the office.A small portion of the business owners expressed that more skilled staff would make running the business easier but most of the 39% are not willing to put the necessary time and resources into training and developing staff to help alleviate the strain.This is exactly why so many small business owners are overworked, not
    s etc.).
  • Host name resolutions.
  • Linkage properties.
  • Content properties.
  • Content evolution properties.
  • Clustering properties.

  • URL Properties

    Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.

    The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits -- and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.

    Host Name Resolutions

    One can notice that Google, given a query q, tends to rank a page higher if the host component of the page's URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.

    This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs -- to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.

    To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.

    Linkage Properties

    The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.

    In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.

    Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.

    Content Properties

    Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.

    For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).

    Content Evolution

    The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

    The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

    Clustering Properties

    Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

    To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on

    Do You Use These Strategies to Manage Your Mood?
    Stress is our reaction to people or things going on around us. Sometimes these things are positive, i.e. a vacation, a promotion or a special event. Sometimes the things are negative, i.e. a traffic ticket, someone you care about is ill, or projects at work are not meeting deadlines.How do you manage your mood when:• There is too much to do and not enough time to do it in?• People around you seem to have time to do fun things and you don't?• Things are happening around you that you have no control over?• Projects at work are not meeting deadlines and you are working longer hours?Susan Vaugham, MD, a psychiatry professor at Columbia University states in her book, Half Empty, Half Full that optimists have the ability to feel in charge of themselves. This does not mean everything will go well. It means they will not be overwhelmed emotionally when things get hectic or do not go well.When you are unhappy about how things are going, there are three strategies which will help you move forward in a positive direction:1. Alter.2. Avoid.3. Accept.To decide which is the best strategy for you to use:• Ask yourself if you have control of the situation or if you can influence someone who does have control. If your answer is yes, often you can alter the situation by problem solving, direct communication, organizing and planning or time management. If your answer is no, altering is usually not an option. There is no point in focusing on things you cannot change.• Know what your top five val
    ands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.

    To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.

    Linkage Properties

    The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.

    In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.

    Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.

    Content Properties

    Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.

    For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).

    Content Evolution

    The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

    The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

    Clustering Properties

    Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

    To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on

    Do You Need A Best Make Money Way To Earn Cash Fast Cleaning Computers?
    Due to the escalating numbers of viruses and spyware hammering computers every day worldwide, an amazing new opportunity has presented itself to make some easy money in the virus removal business.It's not difficult to see that there is an ever growing number of cries for help from desperate people who have been attacked by yet another virus after it somehow slipped through their antivirus net.This instantly creates maybe the best make money way to have emerged in quite some time in respect to computer owners.A small tip for every computer operator who spends time on the internet, is that in my experience I have learned there is no one antivirus on the planet that has a 100% record of blocking viruses etc. There are some which may stop as high as 90 to 95%, but at the end of the day, a small number are still likely to slip through.I would expect that you have at some time atleast heard stories of the devastation these ugly parasites can cause to a previously healthy computer. The desparate need for fast cleaning of an infected computer would of course hasten a user to find a local service to help them in their hour of need.So then, is this not the best make money way you have heard of in quite a while? Yes you say? The fact is there is a rapidly growing call for this type of service solely on it's own and without any other computer technical services offered.Following is a list of symptoms to watch out for in your computer that quite possibly could mean you have a virus.Slow booting
    asy to detect using statistical methods.

    For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).

    Content Evolution

    The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

    The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

    Clustering Properties

    Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

    To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.

    The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)

    To Sum Up

    The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.

    References:

    1. Dennis Fetterly, Mark Manasse, Marc Najork. "Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages" (2004). Microsoft Research. Available at: http://research.microsoft.com/~najork/webdb2004.pdf

    2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. "Syntactic Clustering of the Web". In 6th International World Wide Web Conference, April 1997.

    Graphics omitted. The full version of the article can be found here: Search Engines vs. SEO Spam: Statistical Methods

    HTTP = HTML link (for blogs, profiles,phorums):
    <a href="http://www.casualarticles.com/article/76515/casualarticles-Search-Engines-vs-SEO-Spam-Statistical-Methods.html">Search Engines vs. SEO Spam: Statistical Methods</a>

    BB link (for phorums):
    [url=http://www.casualarticles.com/article/76515/casualarticles-Search-Engines-vs-SEO-Spam-Statistical-Methods.html]Search Engines vs. SEO Spam: Statistical Methods[/url]

    Related Articles:

    IT Marketing: Rewarding Referrals

    Drop Shipping The Secret of Success

    Internet Marketing

    Bookmark it: del.icio.us digg.com reddit.com netvouz.com google.com yahoo.com technorati.com furl.net bloglines.com socialdust.com ma.gnolia.com newsvine.com slashdot.org simpy.com shadows.com blinklist.com