Via Twitter, blogs, or talking with our people, you may have heard us mention the ‘scraping’ problem we have. In short, individuals and companies are using automated methods to harvest (or ‘scrape’) our data. They do it via a wide variety of methods but most boil down to a couple methods involving a stupid amount of requests made to our web server.
This is bad for everyone, including you. First, it grinds our poor server to a stand-still at times, even after several upgrades to larger hosting plans with more resources. Second, it violates our license as many of these people scraping our data are using it in a commercial capacity without returning anything to the project. Third, it forces us to remove functionality that you liked and may have been using in an acceptable manner. Over the years we’ve had to limit the API, restrict the information / tools you see unauthenticated (e.g. RSS feed, ‘browse’, ‘advanced search’), and implement additional protections to stop the scraping.
So just how bad is it? We enabled some CloudFlare protection mechanisms a few weeks back and then looked at the logs.
- The attacks against OSVDB.org were so numerous, the logs being generated by CloudFlare were too big to be managed by their customer dashboard application. They quickly fixed that problem, which is great. Apparently they hadn’t run into this before, even for the HUGE sites getting DDoS’d. Think about it.
- We were hit by requests with no user agent (a sign of someone scraping us via automated means) 1,060,599 times in a matter of days…
- We got hit by 1,843,180 SQL injection attack attempts, trying to dump our entire database in a matter of weeks…
- We got hit by ‘generic’ web app attacks only 688,803 times in a matter of weeks….
- In the two-hour period of us chatting about the new protection mechanisms and looking at logs, we had an additional ~ 130,000 requests with no user-agent.
To put that in perspective, DatalossDB was hit only 218 times in the same time period by requests with no user agent. We want to be open and want to help everyone with security information. But we also need for them to play by the rules.
For years, we have used Typo3 for our blog, hosted on one of our servers. It isn’t bad software at all, I actually like it. That changes entirely when it sits behind Cloudflare. Despite our server being up and reachable, Cloudflare frequently reports the blog offline. When logged in as an administrator and posting a new blog or comment, Cloudflare challenges me by demanding a CAPTCHA, despite no similar suspicious activity. Having to struggle with a service designed to protect, that becomes a burden is bad. Add to that the administrative overhead of managing servers and blog software and it only takes away from time better spent maintaining the database.
With that, we have migrated over to the managed WordPress offering to free up time and reduce headache. The one downside to this is that Typo3 does not appear to have an export feature that WP recognizes. We have backfilled blogs back to early 2007 and will slowly get to the rest. The other downside is in migrating, comments left on previous blogs can’t be migrated in any semblance of their former self. Time permitting, we may cut/paste them over as a single new comment to preserve community feedback.
The upside, we’re much more likely to resume blogging, and with greater frequency. I currently have 17 drafts going, some dating back a year or more. Yes, the old blogging setup was that bad.