Wednesday, August 5th, 2020

Posted by Nick Gerner

Update: Why did my Domain Authority change?

In this index update we re-calibrated our Domain Authority metric to better reflect the relationships between all domains on the Internet. This means that many websites’ Domain Authority (DA) changed.

Not to worry! If your domain authority went down, so did all of the other domains that had similar link profiles before the index update. (Don’t think about it like something bad happened to your site, think about it like we changed how we view the entire Internet) You can read more about why we did this in the section below.


The good news is, we have an index update for you!  And it’s a couple of days sooner than previously announced.  The bad news is, things were, as you might have noticed, a little rocky this morning.  We had more traffic to the API than ever before, and, through the magic of being a scrappy startup, we all jumped into action.  Fortunately, through the magic of Amazon Web Services, we’ve quickly increased our infrastructure and are serving better than ever.  I do apologize for any issues this might have caused.

I’ve got a few things to say in this post, so you can skip forward if you like:

Page Authority and Domain Authority Change

 

We’ve gotten a lot of feedback about Page Authority and Domain Authority.  We’re excited about these metrics and are using them to power a lot of what we do: sorting links, crawl selection, keyword difficulty.  But lately, it seemed as if things were getting a little… clumpy.  We were packing our numbers too closely together to give a real sense of the spread in authority over the web.  I’ll defer to Ben and Rand, who are working on this a lot, but just to give you a taste:

SEOmoz PA is now 80

Amazon's PA is 89

As you can see, we’ve pulled apart a lot of great sites.  This spread, for example SEOmoz with a PA of 80 and Amazon.com with a PA of 89, better reflects different authorities.

This will have effects across tools, including Keyword Difficulty.  So take a minute to check those out and make sure that what we’re showing you matches your intuition.

New API Dashboard

It’s actually very fitting that we should have more traffic than ever before, because we’ve been hard at work on better serving one of our biggest API consumers: You!  Today we’re launching our SEOmoz API Dashboard

SEOmoz API Dashboard

This dashboard will be the place to go to manage your SEOmoz API account.  Right now we’re including all of your API usage.  This gives you visibility into your API consumption, critical if you’re on the paid plan.  And if you’re on the free plan this gives you some idea of the usage of your tools.  As we improve what we offer both in the API and to support application development, you’ll see more and more here.

Stats on Nofollow vs Rel=Canonical

Last week I had a great chat with Eric Enge at Stone Temple Consulting.  We talked a little bit about the usage of nofollow and rel=canonical over the last year (a big year for both!), but I didn’t have anything concrete to share at the time.  I dug into it and it’s pretty interesting:

rel=nofollow vs rel=canonical usage

As you can see, rel=canonical is really taking off.  Since we started keeping really good stats on its usage in our data, it’s grown in usage by about 50%, in just six months!  We see rel=canonical being used more than either internal or external nofollows.  And internal nofollows have fallen off quite a bit about eight months ago, but are reasonably stable since then.

My hypothesis (without supporting data at the moment), is that two mindsets are winning:

  • use rel=canonical right away
  • if rel=nofollow is working leave it

I’ll leave it to the expert SEOs to debate this (in the comments, please!), but that could well be sound advice.

Index Update Stats

Here are some charts and graphs of the data we’ve updated since last month.

Pages in Fresh Linkscape Index

We’re staying on course with our current update rate for pages.  We’ve got updated information for about 43 billion pages.

Links in the Fresh Linkscape Index

And we have a corresponding update for links to and from those pages.

Domains in the Fresh Linkscape Index

We’ve got two focuses for our data updates:

  • Get those domains which we used to think about as niche
  • Get deep on those domains that are highly authoritative

This is actually a big initiative we’ve been working on this year.  And we’re already seeing great improvements to our data quality.

 I hope you enjoy the data.  As always, feedback is much appreciated!

Do you like this post? Yes No

Posted by Nick Gerner

Last week we updated the Linkscape index, and we’ve been doing it again this week.  As I’ve pointed out in the past, up-to-date data is critical.  So we’re pushing everyone around here just about as hard as we can to provide that to you.  This time we’ve got updated information on over 43 billion urls, 275 million sub-domains, 77 million root domains, and 445 billion links.  For those keeping track, the next update should be around April 15.

I’ve got three important points in this post.  So for your click-y enjoyment:

Fresh Index Size Over Time

If you’ve been keeping track, you may have noticed a drop in pages and links in our index in the last two or three months.  You’ll notice that I call these graphs "Fresh Index Size", by which I mean that these numbers by and large reflect only what we verified in the prior month.  So what happened to those links?

Linkscape Index Size: Pages

Linkcape Index Size: Links

Note: "March – 2" is the most recent update (since we had two updates this month!)

At the end of January, in response to user feedback, we changed our methodology around what we update and include.  One of the things we hear a lot is, "awesome index, but where’s my site?"  Or perhaps, "great links, but I know this site links to me, where is it?" Internally we also discovered a number of sites that generate technically distinct content, but with no extra value for our index.  One of my favorite examples of such a site is tnid.org.  So we cut pages like those, and made an extra effort to include sites which previously had been excluded.  And the results are good:

Linkscape Index Size: Domains

I’m actually really excited about this because our numbers are now very much in line with Netcraft’s survey of active sites.  But more importantly, I hope you are pleased too.

Linkscape Processing Pipeline

I’ve been spending time with Kate, our new VP of Engineering, bringing her up to speed about our technology.  In addition to announcing the updated data, I also wanted to share some of our discussions.  Below is a diagram of our monthly (well, 3-5 week) pipeline.

Linkscape Index Pipeline

You can think of the open web as having essentially an endless supply of URLs to crawl, representing many petabytes of content. From that we select a much smaller set of pages to get updated content for on a monthly basis.  In large part, this is due to politeness considerations: there’s about 2.6 million seconds in a month, and most sites won’t tolerate fetching one page a second by a bot.  So we only can get updated content for so many pages in a month.

From the updated content we get, we discover a very large amount of new content, representing a petabyte or more of new data. From this we merge non-canonical forms, and remove duplicates, as well as synthesize some powerful metrics like Page Authority, Domain Authority, mozRank, etc.

Once we’ve got that data prepared, we drop our old (by then out of date) data, and push the updated information to our API.  On about a monthly basis we turn over about 50 billion urls, representing hundreds of terabytes of information.

What Happened To Last Week’s Update

In the spirit of TAGFEE, I feel like I need to take some responsibility for last week’s late update, and explain what happened.

One of the big goals we’ve got is to give fresh data.  One way we can do that is to shorten the amount of time between getting raw content and processing it.  That corresponds to the "Newly Discovered Content" section of the chart above.  For the last update we doubled the size of our infrastructure.  In addition to doubling the number of computers we have running around analyzing and synthesizing data, it actually increased the coordination between those computers.  If everyone has to talk to everyone else, and you double the number of people, you actually quadruple the number of relationships. This caused lots of problems we had to deal with at various times.

Another nasty side-effect of all of this was this made machine failures even more common than we experienced before.  If you know anything about Amazon Web Services and Elastic Computer Cloud then you know that those instances go down a lot :)  So we needed an extra four days to get the data out.

Fortunately we’ve taken this as an opportunity to improve our infrastructure, fault tolerance and lots of other good tech start-up buzz words.  Which is one of the reasons we’re able to get this update out so quickly after the previous one.

As always, we really appreciate feedback, so keep it coming!

Do you like this post? Yes No

Posted by Nick Gerner

I know, I promised a Linkscape update by last week.  And I missed it.  But there’s an update today!  Do you forgive me?  No?  Not enough?  Well how about doubling the volume of data available in our free API?  You might have gotten a totally awesome email last week announcing that the free SEOmoz API is now serving up to 1,000 links.  This email was so awesome I just had to share it (nice work Scott!)

 
This is the same free API that’s powering tons of internal reporting tools and plenty of tools you might have already seen.  This includes Carter Cole’s SEO Site Tools toolbar which went volcanic last month.  And he’s not even showing lists of links.  So by some math there’s 1000 times more power available!
 
But seriously, there have been comparisons made between what we’re doing and what you can do with Yahoo! Site Explorer.  The Yahoo! Site Explorer API offers up to 1,000 links.  And there’s no reason we can’t do the same.
 
What do you get with the free API?  You get a lot:
  • Up to 1,000 links to a page, subdomain or root domain (sorted by Page Authority of the linking page)
  • Anchor text for those 1,000
  • Aggregate anchor text counts across all links in our index
  • HTTP status code
  • nofollow indicators
  • Plenty of metrics for data junkies

We’ve got a community submissions page on our wiki, and we love to share neat apps.  So if you build something on our API, send it our way and we’ll make sure the community hears about it.

 

Do you like this post? Yes No

Posted by Nick Gerner

Since the launch of Open Site Explorer and our API update, Chas, Ben and I have invested a lot of time and energy into improving the freshness and completeness of Linkscape’s data.  I’m pleased to announce that we’ve updated the Linkcape index with crawl data that’s between two and five weeks old—the freshest it’s ever been.  We’ve also changed how we select pages, in order to get deeper coverage on important domains and waste less time on prolific but unimportant domains. 

You may recall Rand’s recent post about prioritizing the best pages to crawl, and mine about churn in the web. We’ve applied some of the principles from these posts to our own crawling and indexing. Rand discussed how crawlers might discover good content on a domain by selecting well-linked-to entry points:

In the past, we’ve selected pages to crawl based purely on mozRank.  That turned out to favor some unsavory elements (you know who you are :P ).  Now, we look at each domain and determine how authoritative it is.  From there we select pages using the principle illustrated above:  Highly linked-to pages—the homepage, category pages, important pieces of deep content—link to other important pages we should crawl.  From intuition and experience we believe this gives the right behavior to crawl like a search engine would.

In a past post, I discussed the importance of fresh data.  After all, if 25% of pages on the web disappear after one month, data collected two or more months ago just isn’t actionable.

From now on, we’re focusing on that first bar in the graph above. By the time our data approaches that second bar (meaning most of it is out of date), we should have an index update for you.  If and when we show you historical data, we’ll mark it as such.

What this means for you is that all our tools powered by Linkscape will provide fresher, more relevant data, and we’ll have better coverage than ever.  This includes things like:

As well as products and tools developed outside SEOmoz using either the free or paid API:

There are plenty more.  In fact, you could build one too!

Because I know how much everyone likes numbers, here are some stats from our latest index:

  • URLs: 43,813,674,337
  • Subdomains: 251,428,688
  • Root Domains: 69,881,887
  • Links: 9,204,328,536,611

Our last index update was on January 17th.  You might recall some bigger numbers in the last update.  Because of the changes to our crawl selection, our latest index should exclude a lot of duplicate content, spam pages, link farms, and spider traps while keeping high quality content.

Our next update is scheduled for March 11. But we’ll update the index before then if the data is ready early :)

As always, keep the feedback coming.  With our own toolset relying on this data, and dozens of partners using our API to develop their own applications, it’s critical that we hear what you guys think.

NOTE: we’re still updating the top 500 list at the moment.  We’ll tweet when that’s ready.

Do you like this post? Yes No

Posted by Nick Gerner

The launch of Open Site Explorer last week opens up a lot of link data, filters, and anchor text to a much wider audience than we’ve ever had before.  In that same vein, today we’re announcing our new and improved SEOmoz Free API.

Any registered (it’s free) SEOmoz member can visit our API Portal and get an API key that gives you access to:

  • Data for any URL in our index including
    • Domain and Page Authority
    • mozRank
    • total link count
    • external, followed link count
  • The first 500 links to any page, sub domain or domain
  • Filtering on those links: 301s, Follows, External, etc.
  • The first 3 domains linking to any page, sub domain or domain
  • The first 3 anchor text terms or phrases in links to any page, sub domain or domain

You’re welcome to use this data for private or publicly-facing purposes. We already have a variety of partners integrating this data including:

Check out some sample code and applications on the wiki.

Our idea is that getting this data into the hands of webmasters makes everyone better off: we’re excited about our new authority scores, marketers are thirsty for metrics, and users of all kinds of tools are better off with a deeper look at real data.  The free package will keep you covered up to a million links per month that you’re free to use for any purpose from consulting to building an SEO campaign management suite.

API Cartoon

In addition to the free API (which I think is quite powerful already), we’re expanding our paid API offering. The paid API includes everything above, but also includes:

  • Additional metrics:
    • number of domains that link to you
    • mozTrust
    • number of links to all pages on your domain
    • and more
  • A deeper look at links, way beyond the first 500 (first 100k for each sort per page, domain or sub domain)
  • Plenty of sorts on links:
    • domain authority
    • page authority
    • linking root domains
  • Way more anchor text terms and phrases (up to 100k per page, domain or sub domain if you’ve got that many)

This is exactly the same API powering Open Site Explorer.  So if you think OSE missed a feature, or should include other data sources, you can build it over again and do an even better job :)   If you do, drop me a line and I’ll take a look. We’d love to share partner apps on our wiki, Twitter, the blog, and elsewhere.

We don’t even have an attribution requirement. Although, we have a tasty 15% discount if you do cite us as a source ;)

To sign up, just contact us, and we’ll start the process.

EDIT: The paid API is available outside of a PRO membership.  A PRO membership buys the tools, and content, and sweet sweet badge.  The paid API is extra.  Of course, the free API is both free and full of awesome.

Do you like this post? Yes No

Posted by Nick Gerner

As we rapidly approach the end of 2009 and opening of 2010, we’ve got a much anticipated index update ready to roll out gang.  Say it with me "twenty-ten".  Oh yeah, I’m so gonna get a flying car and a cyberpunk android :)    …Ahem.  I thought this would be a great time to take a look back at the year and ask, "where did all those pages go?"  Being a data-driven kind of guy, I want to take a look at some numbers about churn, freshness and what it means for the size of the web and web indexes over the last year, and the hundreds of billions, indeed trillion plus urls we’ve gotten our hands on.

This index update has a lot going on, so I’ve broken things out section by section:

An Analysis of the Web’s Churn Rate

Not too long ago, at SMX East, I heard Joachim Kupke (senior software engineer on Google’s indexing team) say that "a majority of the web is duplicate content". I made great use of that point at a Jane and Robot meet up shortly after.  Now, I’d like to add my own corollary to that statement: "most of the web is short-lived".

Churn on the Web

 

After just a single month, a full 25% of the URLs are what we call "unverifiable".  By that I mean that the content was either duplicate, included session parameters, or for some reason could not be retrieved (verified) again (404s, 500s, etc.).  Six months later, 75% of the tens of billions of URLs we’ve seen are "unverifiable" and a year later, only 20% qualifies for "verified" status. As Rand noted earlier this week, Google’s doing a lot of verifying themselves.

To visualize this dramatic churn, imagine the web six months ago…

the web six months ago

Using Joachim’s point, plus what we’ve observed, that six-month old content today looks something like this:

what remains of the the six month old web

What this means for you as a marketer is that some of the links you build and content you share across the web is not permanent. If you engage heavily with high-churn portions of the web, the statistics you monitor over time can vary pretty wildly. It’s important to understand the difference between getting links (and republishing content) in places that will make a splash now, but fade away, versus engaging in lasting ways.  Of course, both are important (as high-churn areas may drive traffic that turns into more permanent value), but the distinction shouldn’t be overlooked. 

Canonicalization, De-Duping & Choosing Which Pages to Keep

Regarding Linkscape’s indices, we capture both of these cases:

  • We’ve got an up-to-date crawl including fresh content that’s making waves right now. Blogscape helps power this, monitoring 10 million+ feeds and sending those back to Linkscape for inclusion in our crawl.
  • We include the lasting content which will continue to support your SEO efforts by analyzing which sites and pages are "unverifiable" and removing these from each new index. This is why our index growth isn’t cumulative — we re-crawl the web each cycle to make sure that the links + data you’re seeing are fresh and verifiable.

To put it another way, consider the quality of most of the pages on the web, as measured, for instance, by mozRank:

Most Pages are Junk (via mozRank)

I think the graph speaks for itself. The vast majority of pages have very little "importance" as defined by a measure of link juice. So it doesn’t surprise me (now at least) that most of these junk pages are disappearing after not too long.  Of course, there are still plenty of really important pages that do stick around.

But what does this say about the pages we’re keeping?  First of let’s take out any discussion of the pages that we saw over a year ago (as we’ve seen above, there’s likely less than 1/5th of them remaining on the web).  In just the past 12 months, we’ve seen between 500 billion and well over 1 trillion pages depending on how you count it (via Danny at Search Engine Land).

Linkscape URLs in the last year

So in just a year we’ve provided 500 billion unique urls through Linkscape and the Linkscape powered tools (Competitive Link Finder, Visualization, Backlink Analysis, etc.). And what’s more, this represents less than half of the URLs we’ve seen in total, as the "scrubbing" we do for each index cuts approx. 50% of the "junk" (including canonicalization, de-duping, and straight tossing for spam and other reasons). There’s likely many trillions of URLs out there, but the engines (and Linkscape) certainly don’t want anything close to all of these in an index.

Linkscape’s December Index Update:

From this latest index (compiled over approx. the last 30 days) we’ve included:

  • 47,652,586,788 unique URLs (47.6 billion)
  • 223,007,523 subdomains (223 million)
  • 58,587,013 root domains (59.5 billion)
  • 547,465,598,586 links (547 billion)

We’ve checked that all of these URLs and links existed within the last month or so.  And I call out this notion of "verified" because we believe that’s what matters for a lot of reasons:

I hope you’ll agree. Or, at least, share your thoughts :)

New Updates to the Free & Paid Versions of our API

I also want to call a shout out to Sarah who’s been hard at work on repackaging our site intelligence API suite.  She’s got all kinds of great stuff planned for early the coming year, including tons of data in our free APIs.  Plus she’s dropped the prices on our paid suite by nearly 90%.

Both of these items are great news to some of our many partners, including:

Thanks to these partners we’ve doubled the traffic to our APIs to over 4 million hits per day, more than half of which are from external partners!  We’re really excited to be working with so many of you.

Do you like this post? Yes No