The Wikipedia Model
Posted by russvirante
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
As an SEO agency, Virante has always prided itself in having research-based answers to the questions presented by our clients. A year or so ago, I caught myself referring to the a site as having "a great looking natural link profile" without really having an numbers or analysis to describe exactly what that profile should look like. Sure, I could point out a spam link or two, or what looked like a paid link, but could we computationally analyze a backlink profile to determine how "natural" it was?
We dove into this question several months ago while trying to identify automated methods to identify link spam and link graph manipulation. This served dual purposes – we wanted to make sure our clients were conforming to an ideal link model to prevent penalties and, at the same time, wanted to be able to determine the extent to which competitors were scamming their way to SEO success.
Building the Ideal Link Model
The solution was quite simple, actually. We used Wikipedia’s natural link profile to create an expected, ideal link data set and then created tools to compare the Wikipedia data to individual websites…
- Select 500+ random Wikipedia articles
- Request the top 10,000 links from Open Site Explorer for each Wikipedia article
- Spider and Index each of those backlink pages
- Build tools to analyze each backlink on individual metrics
Once the data was acquired, we merely had to identify the different metrics we would like to compare against our client’s and their competitors’ sites and then analyze the data set accordingly. What follows are three example metrics we have used and the tools for you to analyze them yourself.
Link Proximity Analysis
Your site will be judged by the company it keeps. One of the first and most obvious characteristics to look at is what we call Link Proximity. Most paid and spam links tend to be lumped together on a page such as 20 backlinks stuffed into a blog comment or a sponsored link list in the sidebar. Thus, if we can create an expected ideal link proximity from Wikipedia’s link profile, we can compare it with any site to identify likely link manipulation.
The first step in this process was to create the ideal link proximity graph. Using the Wikipedia backlink dataset, we determined how many OTHER links occurred within 300 characters before or after that Wikipedia link on the page. If no other links were found, we recorded a 1. If one other link was found, we recorded a 2. So on and so forth. We determined that about 40% of the time, the Wikipedia link was by itself in the content. About 28% of the time there was one more link near it. The numbers continued to descend from there.
Finally, we plotted these numbers out and created a tool to compare individual websites to Wikipedia’s model. Below is a graph of a known paid-link user’s link proximity compared to Wikipedia’s. As you will see, nearly the same percentage of their links are standalone. However, there is a spike at five proximal links for the paid link user that is substantially higher than that of Wikipedia’s average.
Even though paid links only represent a ~25% proportion of their link profile, we were able to detect this anomaly quite easily. Here is the Link Proximity Analysis tool so that you can analyze your own site.
White Hat Takeaway: If you are relying on link methods that place your link in a list of others (paid, spam, blog-rolls, etc.), your links can be easily identified. While I can’t speak for Google, if I were writing the algorithm, I would stop passing value from any 5+ proximal links more than one standard deviation above the mean. Go ahead and use the tool to determine if your site looks suspicious. Run the tool on your site and make sure that you are within about 18% of Wikipedia’s pages for 4+ proximal links.
Source Link Depth Analysis
The goal of Paid Links is to boost link juice. The almighty PageRank continues to be the primary metric which link buyers use to determine the cost of a link. Who buys a PR0 link these days? It just so happens that PageRank tends to be highest on the homepage of sites, so most Paid Links also tend to come from the homepage. This is another straightforward method for finding link graph manipulation – just determine what percentage of the links come from homepages vs. internal pages.
Once again, we began by looking at the top 10,000 backlinks for each 500 random Wikipedia pages. We then tallied the number of folders deep for each link acquired. For example, a link from http://www.cnn.com would score a 1. From http://www.cnn.com/politics would score a 2. We created a graph of the percentage at which each of these occurred and then created a tool to compare this ideal model to that of individual websites.
Below is an example of a known paid-link user’s site.
As you can see, 79% of their top links come from the homepages of websites, compared to Wikipedia’s articles with average around 30%. SEOmoz, on the other hand, receives only 40% of its links from homepages, well within the standard deviation, and Virante receives 29%. Here is the Source Link Depth Analysis tool so that you can compare your site to Wikipedia’s.
White Hat Takeaway: If your link strategy involves getting links primarily from the homepages of websites, the pattern will be easily discernible. Run the tool and determine whether you are safely within 15% of Wikipedia’s pages in terms of homepage links.
Domain Links per Page Analysis
Yet another characteristic we wanted to look at was the number of links per page pointing to the same domain. Certain types of link manipulation like regular press releases, article syndication, or blog reviews tend to build links two and three at a time, all pointing to the same domain. A syndicated article might link to the homepage and two product pages, for example. Our goal was to compare the expected number of links to Wikipedia pages from a linking page to the actual number of links to a particular website, looking for patterns and outliers along the way.
We began again with the same Wikipedia dataset, this time counting the number of links to Wikipedia from each linking page. We tallied up these occurrences and created an expected curve. Finally, we created a tool to compare this curve against that of individual sites.
The example below is a site that heavily relied on paid blog reviews. As you will see, there is a sharp spike in links from pages with three inbound links to the domain.
Caveat: Chances are when you run this tool you will see a spike at position #1. It is worth pointing out that the majority of website homepages tend to fall in this category. When you run this tool, as with the others, you should probably take a second to look at your competitors as well. Is your site closer to Wikipedia’s model than your competitors? That is the question you should be asking first.
White Hat Takeaway: Is your link strategy creating patterns in domain links per page? A natural link graph will have great variation in this. Moreover, it is not uncommon for authoritative sites to have 10+ links to pages from sites. This should be expected – if your site is the authority, it would make sense for it to be cited several times on a thorough page about your subject matter. Here is the Multiple Links Analysis tool to compare your site to Wikipedia’s.
What to Do?
First things first, take every result you get with a grain of salt. We have no reason to believe that Google is using Wikipedia’s backlink profile to model what is and is not acceptable, nor do we pretend to believe that Google is using these metrics. More importantly, just because your site diverges in one way or another from these models does not mean that you are actually trying to manipulate the link graph. If anything, it demonstrates the following…
- If you are manipulating the link graph, it is pretty easy to see it. If Virante can see it, so can Google.
- If you are still ranking despite this manipulation, it is probably because Google hasn’t caught up with you yet, or you have enough natural links to rank despite those that have been devalued.
So, what should you do with these results? If you are using a third party SEO company to acquire links, take a hard look at what they have done and whether it appears to differ greatly from what a natural link profile might look like. Better yet, run the tool on your competitors as well to see how far off you are compared to them. You don’t have to be the best on the Internet, just the best on your Keyword.
Tool Links One More Time: