Sunday, October 22nd, 2017

Post-Panda, Your Original Content is Being Outranked by Scrapers & Partners

0

Posted by BryanCrow

A weird thing has happened as a result of panda. Something you might have expected Google’s Search Quality testers to catch before rolling the update out. Due to the domain-wide nature of the signal, high-quality, original content produced by the websites who were negatively impacted are now being ranked below the exact same content, republished by partners to whom they syndicate. Even more egregious, they are also being outranked by scrapers who effectively steal and republish the same content without permission or credit.

I have seen this briefly mentioned by observers, but I haven’t seen this phenomenon transparently documented either in SEO press or in the Panda Google forum. The purpose of this post is to transparently share data from the site WonderHowTo.com (of which I am the CTO) and locate others experiencing a similar phenomenon.

Pre Panda

For three years, we at WonderHowTo organized the sprawling world of HowTo with taxonomical zeal and very human curation. By January, we had grown to more than 10mm monthly uniques. As our community formed, we began to shift our efforts towards the concept of covering timely news in the HowTo space (there is astounding innovation each day among the 427 subcategories we follow).

Our journalistic cred grew, and at the beginning of the year, two fantastic syndication partners Business Insider, and Huffington Post recognized our quality and eagerly published our articles in their sections (primarily Technology). On occasion, we noticed that our articles were outranked by our partners, but over the course of a few days, Google always got it right, recognizing the source as WonderHowTo. For the record, pre-Panda, we cannot recall one instance when a scraper outranked us with our own content in Google. Never. There seemed to be order in the universe.

Post Panda

Our Google traffic fell by 40%. Among our 1 million indexed pages, we experienced plenty of displaced rankings. Before getting into the what, how, & why, one thing has stood out as alarmingly egregious: Original content created by us is no longer able to rise to the top above our partners or even scrapers who republish our content. Ever. Panda branded us the Rosa Parks of content, forcing us to the back of Google’s ranking bus, along with all the other sites which fit its profiling.

Crediting the Original Source – Google vs Bing

I took a look at the articles we’re promoting on our home page and syndicating to Business Insider and Huffington Post. As I mentioned earlier, our articles also tend to get scraped and republished on dozens of sites within minutes of them being published. Post panda, it turns out Bing is doing a better (though still imperfect) job of ranking the original source (WonderHowTo) above the scrapers & syndication partners. Here are examples from a few recent posts (For simplicity, I searched for each article’s exact title):

"How To Remove Your Name and Profile Picture from Facebook’s Social Ads"

Original Source is #9 on Google

"Transform Your Android Home Screen into a 3D Environment with the SPB Shell 3D Launcher App"

Original Source is #7 on Google

"How to Add a Dislike Button to Your Facebook Page"

Original Source is #14 on Google

The larger implication is that if Google cannot rank the source first when searching for the exact title, then the source will also lose out on traffic from any additional keyword variations that the very same content ends up receiving on scraper and partner sites.

Deconstructing The Panda Damage

Our process has always revolved around human curation with the goal of weeding out anything low quality, it seemed odd that the hit would be so large. We did a deep analysis on a variety of signals (article word count, title word count, how many links, embedded media, how many comments, how many favorites, bounce rate, etc) to try to determine which individual pieces of content were getting hit the most.

We separated the content that gained the most traffic to compare against the content that had lost the most traffic, comparing signals & looking for trends. The results seemed random. Very short video descriptions would rank quite well, while long, detailed original transcriptions and guides were suffering. Every time we thought we’d found an influencing signal, we’d go on to find enough exceptions to negate it.

It became abundantly clear that Panda does not work by filtering out individual low quality content as was originally implied. It works by punishing entire domain names if an undetermined percentage of the content on that site meets the undefined "low-quality" criteria. Soon after we came to this realization, Google confirmed it in a statement to Search Engine Land, and an interview with WIRED.

This Site-Wide Approach Punishes High Quality Results

With this signal hitting an entire site instead of just its individual low quality content, the results fundamentally oppose the stated goal of search quality and fairness in attribution. The collateral damage results in Google burying the original source of high quality content, promoting those who steal, scrape, and republish above them. Furthermore, it ends up demoting other top quality results simply because of the domain on which the content resides. It’s counter-intuitive to think that prejudicially branding every piece of a particular site’s content, past, present and future is an effective way to promote top quality results.

Trying To Resolve Your Site-Wide Demotion

Within a week, several search analysis reports started popping up with post-mortem break-downs. Most were fundamentally flawed in that they only looked at the number of ranking places each site would loose without taking search quantity and click through rate into account. The bottom line is that the difference between ranking 1st and ranking 2nd is mammoth. As such if your site ranked #1 for a couple hundred popular queries and you got flagged by panda, the bulk of your traffic loss would be from those #1 positions changing to #2 to #10 positions. Shifts between #4-#8 don’t make nearly as much of a difference. But I digress.

A consensus has been forming across the web stating that if you remove duplicate and otherwise low-quality content from your site, or do the work of telling Google not to index it, your classification as low-quality under panda would be lifted. The idea that you can get out from under this cloud started to gain traction as a couple of stand out examples started showing up.

Find Your "Problem Content"

The vast majority of content on WonderHowTo was written by our team of editors, researchers, and curators. It has always been our policy to write original descriptions for the videos our curators approve for our library so as to ensure authenticity, accuracy, and relevance. It is part of the added value we bring to the table when embedding how-to videos from youtube, vimeo, or any of the other 17,000 creators we’ve curated in our hunt for useful and excellent HowTos (Talented video creators often produce an excellent tutorial with zero regard to title or description, rendering them invisible to search. To these compelling voices, we have sent a steady stream of deserved traffic).

Over the years we have also consummated one-off agreements with a handful of partners who requested that we use their own specific descriptions, word-for-word, when including their content on our site. As was the Pre-Panda norm, Google would always rank the original source 1st, so there was no need for any one-off no-index tags to keep rankings in their correct place.

With the growing consensus that such republishing could be a major signal in getting a domain flagged, it seemed apparent that our biggest problem might be this content from our partners. After auditing our library, we found that about 16% of our content had been republished word for word from one of these partners. We would have to noindex these to take them out of search visibility.

Enact Your Sweeping Changes to Remove Your Problem Content

Once you’ve identified all your problematic content, it’s time to noindex it. Digital Inspiration made a number of similar changes and saw his rankings restored within two weeks. Here are the changes we made to WonderHowTo as of March 25, 2011:

1. Duplicate Content from Syndication Partnerships
We added a robots noindex meta tag to each page where content was republished from one of our partners.

2. Related Video Pages
We realized that the pages we have that show all the related videos to a particular video were allowed to be indexed. So, we added a robots noindex meta tag to each of those pages.

3. Un-embeded Video Pages
When we don’t embed how-to videos from around the web that we feel meet our quality guidelines for inclusion on our library, we provide a link for people to watch that video on the source site. As people who land on these pages from a google search may find this page to be an intermediary page, we think these may tripping the signal as well. So, we added a robots noindex meta tag to each of those pages.

4. Tag Pages
According to Digital Inspiration, allowing tag pages with inadequate content to be indexed may also trip this flag, so we added a robots noindex meta tag to all topic pages with fewer than 4 useful videos on them.

5. Page Link Count
I read that too many links on a page may have also been a signal. So, we cut the limit of the number of related topics to show on any given page down by 50%.

Wait for your Changes to Take Effect

Within a week, Google had re-crawled enough of our content to start removing the no-indexed pages from the index. We knew this would result in an additional drop in search traffic, but the hope was to rectify the side effect of Google ranking our high-quality content lower than the scrapers who republish it.

We are hopeful that the changes we’ve made will remove this site-wide flag, or that Google will tweak the algorithm to only target low quality content as opposed to an entire site. But as of today, (4/19/2011), the problem still exists. Google continues to drive people who search for our content to the republished versions on our partners sites and the sites who scrape us without permission or attribution. Our search traffic has declined (now partially because of our noindexing changes), and our high quality content continues to be outranked by less helpful results.

If you have a site that is experiencing a similar phenomenon, let us know in the comments. This behavior seems contrary to the fundamentals of search quality, and Panda specifically. Without making some noise about it, it may never be corrected.

Do you like this post? Yes No

Related posts:

  1. Yahoo! Partners with Former NBC Exec’s Venture for Original Content
  2. Why Mahalo (and Other Content Scrapers) Render Google’s Spam Team Flaccid
  3. Is the Huffington Post Google’s Favorite Content Farm?
  4. Google Panda Update Tip: Remove Low-Quality Content
  5. Foursquare, IFC, Huffington Post In Geo-Located Content Deals

Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!