How bad is the scraping problem?

Via Twitter, blogs, or talking with our people, you may have heard us mention the ‘scraping’ problem we have. In short, individuals and companies are using automated methods to harvest (or ‘scrape’) our data. They do it via a wide variety of methods but most boil down to a couple methods involving a stupid amount of requests made to our web server.

This is bad for everyone, including you. First, it grinds our poor server to a stand-still at times, even after several upgrades to larger hosting plans with more resources. Second, it violates our license as many of these people scraping our data are using it in a commercial capacity without returning anything to the project. Third, it forces us to remove functionality that you liked and may have been using in an acceptable manner. Over the years we’ve had to limit the API, restrict the information / tools you see unauthenticated (e.g. RSS feed, ‘browse’, ‘advanced search’), and implement additional protections to stop the scraping.

So just how bad is it? We enabled some CloudFlare protection mechanisms a few weeks back and then looked at the logs.

  • The attacks against OSVDB.org were so numerous, the logs being generated by CloudFlare were too big to be managed by their customer dashboard application. They quickly fixed that problem, which is great. Apparently they hadn’t run into this before, even for the HUGE sites getting DDoS’d. Think about it.
  • We were hit by requests with no user agent (a sign of someone scraping us via automated means) 1,060,599 times in a matter of days…
  • We got hit by 1,843,180 SQL injection attack attempts, trying to dump our entire database in a matter of weeks…
  • We got hit by ‘generic’ web app attacks only 688,803 times in a matter of weeks….
  • In the two-hour period of us chatting about the new protection mechanisms and looking at logs, we had an additional ~ 130,000 requests with no user-agent.

To put that in perspective, DatalossDB was hit only 218 times in the same time period by requests with no user agent. We want to be open and want to help everyone with security information. But we also need for them to play by the rules.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 5,028 other followers

%d bloggers like this: