Beating the Scraper Caper
June 16th, 2008 · by David Bradley
I’ve been running websites since 1996. My first one was on one of those Freenet systems at a US university that you could only access with the Lynx text-only browser using a telnet connection, another was on webspace provided with my original home ISP, and yet others were not strictly speaking websites at all but gopher spaces…I could go on (and, some might say, frequently do). Anyway, during the last 12 or so years I’ve seen a gradual increase in the number of sites trying to cash in with minimal effort of their own on the work of dedicated webmasters like myself.
The advent of RSS newsfeeds gave these guys a much simpler in to exploiting all our hard work by simply syndicating content, tacking on a few adverts, and then sitting back while Google spidered “their” content and helped them monetize their sites. Most bloggers and webmasters know these as scraper sites, although actually that term was originally meant to describe a method of grabbing updates from sites without an RSS, purely for personal use as opposed to copying someone else’s site.
There are lots of problems with scraper sites as they have come to be known. First, they’re more than likely breaching copyright law, although that’s a little dubious as RSS is meant to be really simple syndication and it could be argued that other sites displaying one’s content is 50% of its raison d’etre. So, with copyright in mind, most bloggers hate scrapers with a vengeance. Depending on the scraper’s scruples they may remove all mention of your site and internalize all links so that you gain absolutely nothing from their existence. Moreover, if Google for whatever reason comes to view the scraped content as somehow the original, then the scraper might overtake you in the search engine results pages (SERPs), which again is not good.
A simple way to try and work around this latter issue is to make sure you have some hard-coded links to your other content in each post you produce and to use a plugin, such as RSS Footer, which will add a link back to the original post wherever it’s being scraped. This does not work, of course, if the scraper strips all html code from the newsfeed before displaying it on their site. But, those that do that also generally lose the images associated with a post, so it’s not the most common approach.
My Sciencebase site is a member of the always-enlightening DNA Network group of blogs, which has recently been scraped by a rather garish site. A fellow blogger (Ricardo Vidal of My Biotech Life, after a tip off by William Gunn of the Synthesis blog) in the DNA Network alerted members to the existence of this ugly splog (spam blog) that was rendering everyone’s feed and encouraged us all to complain and log with the host the fact that this blog was breaking the law and the host’s terms of service. He also suggested adding some code to our .htaccess files to prevent the scraper from scraping.
Now, as far as I can tell, scrapers that pull the feed from Feedburner, do that directly for those sites rendering their feeds with that system. There’s nothing you could do on your site’s hosting with .htaccess or anything else to prevent the scraper access all the information it needs from Feedburner, other than the images.
Subsequent correspondence via the network’s discussion group regarding this suggested that some success can be had redirecting scrapers to a “syndication not allowed on this site” message as well as running antileech plugins. However, I think that approach really only works if Feedburner is not handling your feed.
There are lots of splogs on Google Blogger Blogspot and they have a “flag blog” button at the very top. However, I have used this in the past and under US copyright law you have to follow up any such complaint with an official written complaint, which could ultimately lead to a court case that you may or may not win.
Gunn hints at the notion that Feedburner adding a “block this site” function would be very useful as the scraper in question really “looks crappy and potentially screws with your Google rankings if the splog doesn’t have rel=nofollow set and they get spidered.” I think I’ve addressed that issue above. He adds, however, that, “It’s not copyright that I’m worried about. That is the whole point of RSS, and hell, I’ve even got a creative commons license applied to my feed via Feedburner! My concern is effectively revoking said permission for those who violate the guidelines. Any juice that comes in via splog isn’t juice I want, period. Finally, it’s worth mentioning that not every scraper pulls its content from Feedburner (some actually scrape) and not everyone uses Feedburner, so it’s in those cases where antileech plugins or .htaccess rules are useful, and they’re also useful for dealing with hotlinkers in general.”
Splogs come and go, nothing much is effective in eradicating them all. Where a site has lifted wholesale one of my sites, rather than simply syndicating my feed, and so passing off as if they really were my original site, I have gone straight to the site’s host with a stiff legal letter backed by a solicitor’s address and seen the splog disappear without trace to be replaced by a standard hosting company holding page within a day.
Google’s Sven Naumann recently blogged about how well Google does at spotting the original, definitive source for sites that are scraped. But, it is an uphill battle nevertheless. I think, however, that there are ways to fight the battle not by sharpening swords, but by honing words, that could in some cases benefit you as a scraping victim. And, I’ve found it to work very effectively in two cases of scraping of my Sciencebase site, which resulted in a massive traffic spike for a couple of separate, unrelated posts being scraped by two separate splogs.
Instead of sending threatening cease and desist letters to the scrapers, I approached them with a compromise. After all, they’re not going to go away and they’ll only register a new domain and re-scrape my sites if they feel the keyword densities suit their monetization methods. So, my deal was to ask them to recode their scraping algorithm to make sure that all hard-coded links to Sciencebase were retained and that they gave due and proper credit to my site as original content source, with a proper hard backlink with dofollow.
Like I say, this worked well for two cases, where traffic for a couple of old posts skyrocketed because the scraper had been using AdWords to spread the word about his site. Moreover, the splogs actually had pagerank, albeit around 3/4, which meant with those dofollow links, there was a little link juice seeping back into Sciencebase. It was an enforced compromise rather like an unwanted marriage of convenience, one might say, and I suspect we’ll end up divorced soon enough, but for the time being there are ongoing benefits.
I’m sure Sciencetext readers will have their own answers to being a victim of splogging and scraping, so do let us know your thoughts on the above and your experiences in reversing the scraper caper.



















6 responses so far ↓
Thanks for putting this up. Hopefully someone at Feedburner will eventually allow a per-feed IP ban.
Mr. Gunns last blog post..I knew it could happen, I’ve seen it happen, I was watching out for it.
That would certainly be very useful. Some of these scrapers really are awful and even getting back a little link juice from them is not worth the potential damage to the reputation of one’s site if new visitors make the wrong assumption about the scraped site.
Google is actually proactive with scrapers. A while back, Google no longer pay out cheques to non-US/Canada bloggers for AdSense, possible due to legal jurisdiction (or lack of it). Most scrapers are from outside NA. Anything from North America is easy to shut down, by a phone call/email to the ISP.
Now it is a bother to continually beat down on those guys. It’s like email spam. They just keep coming back.
Rudys last blog post..Dance With My Father
Rudy, that’s good to know. It’s easy to forget (when one is running Adblock Plus and NoScript) that these guys are running ads on those sites (I just never see them!)
Unfortunately this is a reality on the Internet, people looking for loopholes, but loopholes are always closed sooner or later.
When one loophole closes, another opens to take its place.
Leave a Comment