Machine Learning and Link Spam: My Brush With Insanity
Posted by wrttnwrd
This post was originally in YouMoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
Know someone who thinks they’re smart? Tell them to build a machine learning tool. If they majored in, say, History in college, within 30 minutes they’ll be curled up in a ball, rocking back and forth while humming the opening bars of “Oklahoma.”
Sometimes, though, the alternative is rooting through 250,000 web pages by hand, checking them for compliance with Google’s TOS. Doing that will skip you right past the rocking-and-humming stage, and launch you right into writing-with-crayons-between-your-toes phase.
Those were my two choices six months ago. Several companies came to Portent asking for help with Penguin/manual penalties. They all, for one reason or another, had dirty link profiles.
Link analysis, the hard way. Back when I was a kid…
I did the first link profile review by hand, like this:
- Download a list of all external linking pages from SEOmoz, MajesticSEO, and Google Webmaster Tools.
- Remove obviously bad links by analyzing URLs. Face it: if a linking page is on a domain like “FreeLinksDirectory.com” or “ArticleSuccess.com,” it’s gotta go.
- Analyze the domain and page trustrank and trustflow. Throw out anything with a zero, unless it’s on a list of ‘whitelisted’ domains.
- Grab thumbnails of each remaining linking page, using Python, Selenium, and Phantomjs. You don’t have to do this step, but it helps if you’re going to get help from other folks.
- Get some poor bugger a faithful Portent team member to review the thumbnails, quickly checking off whether they’re forums, blatant link spam, or something else.
After all of that prep work, my final review still took 10+ hours of eye-rotting agony.
There had to be a better way. I knew just enough about machine learning to realize it had possibilities, so I dove in. After all, how hard can it be?
Machine learning: the basic concept
The concept of machine learning isn’t that hard to grasp:
- Take a large dataset you need to classify. It could be book titles, people’s names, Facebook posts, or, for me, linking web pages.
- Define the categories. In this case, I’m looking for ‘spam’ and ‘good.’
- Get a collection of those items and classify them by hand. Or, if you’re really lucky, you find a collection that someone else classified for you. The Natural Language Toolkit, for example, has a movie reviews corpus you can use for sentiment analysis. This is your training set.
- Pick the right machine learning tool (hah).
- Configure it correctly (hahahahahahaha heee heeeeee sniff haa haaa… sorry, I’m ok… ha ha haaaaaaauuuugh).
- Feed in your training set, with the features — the item attributes used for classification — pre-selected. The tool will find patterns, if it can (giggle).
- Use the tool to compare each item in your dataset to the training set.
- The tool returns a classification of each item, plus its confidence in the classification and, if it’s really cool, the features that were most critical in that classification.
If you ignore the hysterical laughter, the process seems pretty simple. Alas, the laughter is a dead giveaway: these seven steps are easy the same way “Fly to moon, land on moon, fly home” is three easy steps.
Note: At this point, you could go ahead and use a pre-built toolset like BigML, Datameer, or Google’s Prediction API. Or, you could decide to build it all by hand. Which is what I did. You know, because I have so much spare time. If you’re unsure, keep reading. If this story doesn’t make you run, screaming, to the pre-built tools, start coding. You have my blessings.
The ingredients: Python, NLTK, scikit-learn
I sketched out the process for IIS (Is It Spam, not Internet Information Server) like this:
- Download a list of all external linking pages from SEOmoz, MajesticSEO, and Google Webmaster Tools.
- Use a little Python script to scrape the content of those pages.
- Get the SEOmoz and MajesticSEO metrics for each linking page.
- Build any additional features I wanted to use. I needed to calculate the reading grade level and links per word, for example. I also needed to pull out all meaningful words, and a count of those words.
- Finally, compare each result to my training set.
To do all of this, I needed a programming language, some kind of natural language processing (to figure out meaningful words, clean up HTML, etc.) and a machine learning algorithm that I could connect to the programming language.
I’m already a bit of a Python hacker (not a programmer – my code makes programmers cry), so Python was the obvious choice of programming language.
I’d dabbled a little with the Natural Language Toolkit (NLTK). It’s built for Python, and would easily filter out stop words, clean up HTML, and do all the other stuff I needed.
For my machine learning toolset, I picked a Python library called scikit-learn, mostly because there were tutorials out there that I could actually read.
I smushed it all together using some really-not-pretty Python code, and connected it to a MongoDB database for storage.
A word about the training set
The training set makes or breaks the model. A good training set means your bouncing baby machine learning program has a good teacher. A bad training set means it’s got Edna Krabappel.
And accuracy alone isn’t enough. A training set also has to cover the full range of possible classification scenarios. One ‘good’ and one ‘spam’ page aren’t enough. You need hundreds or thousands to provide a nice range of possibilities. Otherwise, the machine learning program stagger around, unable to classify items outside the narrow training set.
Luckily, our initial hand-review reinclusion method gave us a set of carefully-selected spam and good pages. That was our initial training set. Later on, we dug deeper and grew the training set by running Is It Spam and hand-verifying good and bad page results.
That worked great on Is It Spam 2.0. It didn’t work so well on 1.0.
First attempt: fail
For my first version of the tool, I used a Bayesian Filter as my machine learning tool. I figured, hey, it works for e-mail spam, why not SEO spam?
Apparently, I was already delirious at that point. Bayesian filtering works for e-mail spam about as well as fishing with a baseball bat. It does occasionally catch spam. It also misses a lot of it, dumps legitimate e-mail into spam folders, and generally amuses serious spammers the world over.
But, in my madness, I forgot all about these little problems. Is It Spam 1.0 seemed pretty great at first. Initial tests showed 75% accuracy. That may not sound great, but with accurate confidence data, it could really streamline link profile reviews. I was the proud papa of a baby machine learning tool.
But Bayesian filters can be ‘poisoned.’ If you feed the filter a training set where 90% of the spam pages talk about weddings, it’s possible the tool will begin seeing all wedding-related content as spam. That’s exactly what happened in my case: I fed in 10,000 or so pages of spammy wedding links (we do a lot of work in the wedding industry). On the next test run, Is It Spam decided that anything matrimonial was spam. Accuracy fell to 50%.
Since we tend to use the tool to evaluate sites in specific verticals, this would never work. Every test would likely poison the filter. We could build the training set to millions of pages, but my pointy little head couldn’t contemplate the infrastructure required to handle that.
The real problem with a pure Bayesian approach is that there’s really only one feature: The content of the page. It ignores things like links, page trust and authority.
Oops. Back to the drawing board. I sent my little AI in for counseling, and a new brain.
Second attempt: a qualified success
My second test used logistic regression. This machine learning model uses numeric data, not text. So, I could feed it more features. After the first exercise, this actually wasn’t too horrific. A few hours of work got me a tool that evaluates:
- Page TrustFlow and CitationFlow (from MajesticSEO – I’m adding SEOmoz metrics now)
- Links per word
- Page Flesch-Kincaid reading grade level
- Page Flesch Kincaid reading ease
- Words per page
- Syllables per page
- Characters per page
- A few other seemingly-random bits, like images per page, misspellings, and grammar errors
This time, the tool worked a lot better. With vertical-specific training sets, it ran with 85%+ accuracy.
In case you're wondering, this is what victory looks like:
When I tried to use the tool for more general tests, though, my coded kid tripped over its big, adolescent feet. Some of the funnier results:
- It saw itself as spam.
- It thought Rand’s blog was a swirling black hole of spammy despair.
False positives remain a big problem if we try to build a training set outside a single vertical.
Disappointing. But the tool chugs along happily within verticals, so we continue using it for that. We build a custom training set for each client, then run the training set against the remaining links. The result is a relatively clear report:
Results and next steps
With little IIS learning to walk, we’ve cut the brute-force portion of large link profile evaluations from 30 hours to 3 hours. Not. Too. Shabby.
I tried to launch a public version of Is It Spam, but folks started using it to do real link profile evaluations, without checking their results. That scared the crap out of me, so I took the tool down until we cure the false positives problem.
I think we can address the false positives issue by adding a few features to the classification set:
- Bayesian filtering: Instead of depending on a Bayesian classification as 100% of the formula we’ll use the Bayesian score as one more feature.
- Grammar scoring: Anyone know a decent grammar testing algorithm in Python? If so, let me know. I’d love to add grammar quality as a feature.
- Anchor text matters a lot. The next generation of the tool needs to score the relevant link based on the anchor text. Is it a name (like in a byline)? Or is it a phrase (like in a keyword-stuffed link)?
- Link position may matter, too. This is another great feature that could help with spam detection. It might lead to more false positives, though. If Is It Spam sees a large number of spammy links in press release body copy, it may start rating other links located in body copy as spam, too. We’ll test to see if the other features are enough to help with this.
If I'm lucky, one or more of these changes may yield a tool that can evaluate pages across different verticals. If I'm lucky.
This is by far the most challenging development project I've ever tried. I probably wore another 10 years' enamel off my teeth in just six weeks. But it's been productive:
- When you start digging into automated page analysis and machine learning, you learn a lot about how computers evaluate language. That's awfully relevant if you're a 21st Century marketer.
- I uncovered an interesting pattern in Google's Penguin implementation. This is based on my fumbling about with machine learning, so take it with a grain of salt, but have a look here.
- We learned that there is no such thing as a spammy page. There are only spammy links. One link from a particular page may be totally fine: For example, a brand link from a press release page. Another link from that same page may be spam: For example, a keyword-stuffed link from the same press release.
- We've reduced time required for an initial link profile evaluation by a factor of ten.
It's also been a great humility-building exercise.
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!