The Scraping Problem and Ethics

[2014-05-09 Update: We'd like to thank both McAfee and S21sec for promptly reaching out to work with us and to inform us that they are both investigating the incident, and taking steps to ensure that future access and data use complies with our license.]

Every day we get requests for an account on OSVDB, and every day we have to turn more and more people away. In many cases the intended use is clearly commercial, so we tell them they can license our data via our commercial partner Risk Based Security. While we were a fully open project for many years, the volunteer model we wanted didn’t work out. People wanted our data, but largely did not want to give their time or resources. A few years back we restricted exports and limited the API due to ongoing abuse from a variety of organizations. Our current model is designed to be free for individual, non-commercial use. Anything else requires a license and paying for the access and data usage. This is the only way we can keep the project going and continue to provide superior vulnerability intelligence.

As more and more organizations rely on automated scraping of our data in violation of our license, it has forced us to restrict some of the information we provide. As the systematic abuse rises, one of our only options is to further restrict the information while trying to find a balance of helping the end user, but crippling commercial (ab)use. We spend about half an hour a week looking at our logs to identify abusive behavior and block them from accessing the database to help curb those using our data without a proper license. In most cases we simply identify and block them, and move on. In other cases, it is a stark reminder of just how far security companies will go to to take our information. Today brought us two different cases which illustrate what we’re facing, and why their unethical actions ultimately hurt the community as we further restrict access to our information.

This is not new in the VDB world. Secunia has recently restricted almost all unauthenticated free access to their database while SecurityFocus’ BID database continues to have a fraction of the information they make available to paying customers. Quite simply, the price of aggregating and normalizing this data is high.

In the first case, we received a routine request for an account from a commercial security company, S21sec, that wanted to use our data to augment their services:

From: Marcos xxxxxx (xxxxxxx@s21sec.com)
To: moderators osvdb.org
Date: Thu, 16 May 2013 11:26:28 +0200
Subject: [OSVDB Mods] Request for account on OSVDB.org

Hello,

I’m working on e-Crime and Malware Research for S21Sec (www.s21sec.com), a lead IT Security company from Spain. I would like to obtain an API key to use in research of phishing cases we need to investigate phishing and compromised sites. We want to use tools like “cms-explorer” and create our own internal tools.

Regards,

S21sec

*Marcos xxxxxx*
/e-Crime///

Tlf: +34 902 222 521
http://www.s21sec.com , blog.s21sec.com

As with most requests like this, they received a form letter reply indicating that our commercial partner would be in touch to figure out licensing:

From: Brian Martin (brian opensecurityfoundation.org)
To: Marcos xxxxxx (xxxxxxx@s21sec.com)
Cc: RBS Sales (sales riskbasedsecurity.com)
Date: Thu, 16 May 2013 15:26:04 -0500 (CDT)
Subject: Re: [OSVDB Mods] Request for account on OSVDB.org

Marcos,

The use you describe is considered commercial by the Open Security
Foundation (OSF).

We have partnered with Risk Based Security (in the CC) to handle
commercial licensing. In addition to this, RBS provides a separate portal
with more robust features, including an expansive watch list capability,
as well as a considerably more powerful API and database export options.
The OSVDB API is very limited in the number of calls due to a wide variety
of abuse over the years, and also why the free exports are no longer
available. RBS also offers additional analysis of vulnerabilities
including more detailed technical notes on conditions for exploitation and
more.

[..]

Thanks,

Brian Martin
OSF / OSVDB

He came back pretty quickly saying that he had no budget for this, and didn’t even wait to get a price quote or discuss options:

From: Marcos xxxxxx (xxxxxxx@s21sec.com)
Date: Mon, May 20, 2013 at 10:55 AM
Subject: Re: [OSVDB Mods] Request for account on OSVDB.org
To: Brian Martin (brian opensecurityfoundation.org)
Cc: RBS Sales (sales riskbasedsecurity.com)

Thanks for the answer, but I have no budget to get the license.

We figured that was the end of it really. Instead, jump to today when we noticed someone scraping our data and trying to hide their tracks to a limited degree. Standard enumeration of our entries, but they were forging the user-agent:

88.84.65.5 – – [07/May/2014:09:37:06 -0500] “GET /show/osvdb/106231 HTTP/1.1″ 200 20415 “-” “Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0″
88.84.65.5 – – [07/May/2014:09:37:06 -0500] “GET /show/osvdb/106232 HTTP/1.1″ 200 20489 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko”
88.84.65.5 – – [07/May/2014:09:37:07 -0500] “GET /show/osvdb/106233 HTTP/1.1″ 200 20409 “-” “Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)”
88.84.65.5 – – [07/May/2014:09:37:08 -0500] “GET /show/osvdb/106235 HTTP/1.1″ 200 20463 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/537.36″

Visiting that IP told us who it was:

s21-warn

So after requesting data, and hearing that it would require a commercial license, they figure they will just scrape the data and use it without paying. 3,600 accesses between 09:18:30 and 09:43:19.

In the second case, and substantially more offensive, is the case of security giant McAfee. They approached us last year about obtaining a commercial feed to our data that culminated in a one hour phone call with someone who ran an internal VDB there. On the call, we discussed our methodology and our data set. While we had superior numbers to any other solution, they were hung up on the fact that we weren’t fully automated. The fact that we did a lot of our process manually struck them as odd. In addition to that, we employed less people than they did to aggregate and maintain the data. McAfee couldn’t wrap their heads around this, saying there was “no way” we could maintain the data we do. We offered them a free 30 day trial to utilize our entire data set and to come back to us if they still thought it was lacking.

They didn’t even give it a try. Instead they walked away thinking our solution must be inferior. Jump to today…

161.69.163.20 – – [04/May/2014:07:22:14 -0500] “GET /90703 HTTP/1.1″ 200 6042 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36″
161.69.163.20 – – [04/May/2014:07:22:16 -0500] “GET /90704 HTTP/1.1″ 200 6040 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36″
161.69.163.20 – – [04/May/2014:07:22:18 -0500] “GET /90705 HTTP/1.1″ 200 6039 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36″
161.69.163.20 – – [04/May/2014:07:22:20 -0500] “GET /90706 HTTP/1.1″ 200 6052 “-” “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36″

They made 2,219 requests between 06:25:24 on May 4 and 21:18:26 on May 6. Excuse us, you clearly didn’t want to try our service back then. If you would like to give a shot then we kindly ask you to contact RBS so that you can do it using our API, customer portal, and/or exports as intended.

Overall, it is entirely frustrating and disappointing to see security companies who sell their services based on reputation and integrity, who claim to have ethics completely disregard them in favor of saving a buck.

mcafee-ethics

About these ads

26 responses

  1. This whole post says to me that McAfee has no interest in making positive contributions to the community that it has taken so very much from. Sit on your hoarded data like a bunch of shysters if you want.

  2. Is there a page that describes the costs? Or is it on a per request basis?

    1. Licensing is based on the data needed (e.g. all of it vs subset), how it is used (e.g. internal only, external, product integration), etc.

      1. So, hellish and arbitrary. If you can’t post your prices, you are trying to extract as much as you possibly based on the user via a nightmarish negotiation (extraction) process.

        Everyday you turn more and more people away. Dump your third party, post easy to understand prices — give people a way to sign up without human interaction — and enjoy the flow of income.

      2. If only it were so easy… (it’s not). If we do a one-price-fits-all model, then a company with a bigger customer base and more marketing reach can simply re-sell our data, embed it in their product, redistribute in various fashions. If you say “make the license prevent that”, then you will start to see the issue. In some cases, redistribution may be OK, but that pricing scheme would be much higher letting us “enjoy the flow of income” even more.

      3. If you make the process of paying anywhere near equivalent to the process of scraping, scraping will win. I have worked at many large (3000+) companies where I was given small (~10k) arbitrary monthly purchases for my team — but no power to do any sort of negotiation or custom arrangements. “Off The Shelf” only.

        This is commonplace. It isn’t my money, it is a corporate credit card. I really don’t care about giving you your fare share, I really don’t. But I do care about massive inconvenience and horrific blood letting negotiation process. I have to bring in my managers manager… the lawyer… two other project managers… etc.

        Why can’t you have prices for “internal use only” right there — call for redistribution. You need to have a way to let someone give you money as soon as they want to give you money — it is a real simple: shut up and take my money! Let people pay you for internal, get hooked on your great data… so great and awesome and complete they want to put it in their product. Now they have an actually REASON to go through the hell that is negotiation.

        You are ending the discussion with this idiotic policy before it begins.

  3. This is one of the reasons why you have to set some kind of price for data or software, even if you have an open data set or an open source project. Companies are *built* to externalize as many costs as possible

    1. Yes. I couldn’t agree more. Even if it is only for internal use, put a price on it so people can give you money. Parsing a site is annoying and means custom development, which costs money.

      If you have a wonderful API, that people can sign up for in seconds… you avoid the entire stealing step, people are lazy. Corporate credit card + wonderful API > breakable bots nonsense.

      1. I appreciate the feedback and have relayed it to the sales side and requested they look into coming up with a structured pricing scheme like this.

  4. they suck at stealing content since they could easily do the scrapping in ways that are harder to block. i am curious though, do you have any type of thresholds or triggers for abuse? shouldn’t those lame scrapping actions be blocked ?

    1. We have some measures in place yes, but people are getting creative in their methods of scraping. We currently have at least 3 distributed botnets slowly scraping the database for example.

  5. As long as you offer data public, there will be others who will try to scrape it, with more advanced tools where you won’t be actually able to track the source. Like a multi pool of ips, that will parse js and everything else (phantomjs bot).
    So if you really want to make people not scrape your data, better to not offer it in the first place, right?

    1. Right. For those asking why we can’t protect the data fully, it goes back to OSF’s mission statement about making the data available for free, to help the average consumer. We want to provide the information for such purposes while not feeding organizations who will profit heavily off our work and not contribute back. Ultimately we may have to restrict more information, but we are doing everything we can to avoid that.

  6. If your data is publicly accessible, McAfee certainly wasn’t doing the ethical thing, but you must shoulder the blame. If you make data available freely on the web, you don’t get to decide who accesses it. Use some kind of login system for accessing data, then you can choose to either give away logins or charge a fee, based on personal or corporate use.

    Ironic that weev is criticizing McAfee for scraping website data… ROTFLOL!

    1. We can’t decide who accesses it, but we can and have set forth a license that dictates how the data is used which is our bigger concern. Both companies explicitly asked us for this data last year to be used for commercial purposes. Meaning, they want to use our data in their commercial model to help them make money. Our license says they have to pay us in such a case. That is our issue.

      1. Your license is meaningless. If the data is publicly accessible, you have no control over who accesses it and how. You are believe that they plan to use your data for their own profit, but you have no evidence of that. For all you know, they may simply be trying to get a decent sample to evaluate the data before buying.

        As far as the license goes, I note that I can search your database, obtaining and viewing large numbers of results, without ever seeing or agreeing to a license. How do you propose that this license is in any way legally binding to me if I misuse your data?

        Further, it wouldn’t be difficult for someone to scrape your data without revealing who they are and without you ever knowing where the data ended up.

        I’ll say it again: if you value this data, and expect to be able to sell it, you need to protect it behind a login system rather than making it publicly available.

      2. Wrong. During the talks with McAfee it was explicitly outlined that a 30 day free trial of the data was available, done so via a different portal, using a different API (or optional exports). Re: your search, the footer of every page has links to our license, ToS, and privacy policy. Finally, this entire ordeal is likely going to result in just that; we are strongly considering closing access to all vulnerability information and only offering some statistics, researcher profiles, etc. Our desire to help the average consumer is being trumped by people that want to take our work and not contribute back to the project in any fashion.

  7. Did you send them a bill for the information they accessed?

  8. I’m sure this post will be censored, but I don’t agree that there’s an ethical issue with scraping or accessing publicly available information. This counter-article does a good job of mounting an rational argument:

    http://blog.erratasec.com/2014/05/no-mcafee-didnt-violate-ethics-scraping.html

    Please change the name of your organization to CSVDB (closed source vulnerability database).

    1. Unfortunately Rob was making a ‘strawman’ argument on Twitter last night and is dancing around some topics in his response. First, Rob and others do not understand how robots.txt work as evident by their Tweets and comments. The notion that every web agent honors them is ridiculous. For example, if someone uses ‘wget’ it will not look for that file (remember the adage, laws only stop good people). Second, if he is so sure that having a different robots.txt would have made things all better, he should consider that neither McAfee or S21sec tried to fetch it to see what it “backed up” as far as our license goes. Third, Rob is a nice guy, but he doesn’t understand contract law in the U.S. Fourth and finally, everyone can cherry pick on the word ‘open’ but seriously, our industry is supposed to be filled to the brim with highly intelligent people, no? The term is “Open Sourced”, meaning that is the term that applies to our data aggregation.

      1. > For example, if someone uses ‘wget’ it will not look for that file
        Except it does, unless you tell it explicitly not to.

        You also didn’t go into Robs argument that crawling your site does only mean they are looking at your data, not that they are using it commercially. That they didn’t take up your offer to test it looks bad, but in a company with several thousand employees it is not that unlikely that whoever downloads it now doesn’t even know that offer existed.

      2. McAfee has a 6+ person team internally that does the exact same thing we do. If a McAfee employee needs bulk vulnerability data, they can simply ask for it internally. Resorting to scripts to pull our data, then additional scripts to normalize it presumably, is very odd. So even if that person didn’t know we were in discussion, they should know about their own offerings.

        And again, robots.txt is such a non-starter on all of this. Rob’s silly notion that the presence of any text in robots.txt makes something more or less official is absurd. That file only stops ‘good’ people from crawling it. Neither company made a request to robots.txt, so his entire point is further irrelevant.

      3. Oh, by default, wget does not look at robots.txt

        root@osvdb:~/logs# wget http://attrition.org/mywgettest
        –2014-05-08 13:11:39– http://attrition.org/mywgettest
        [..]
        HTTP request sent, awaiting response… 404 Not Found
        2014-05-08 13:11:40 ERROR 404: Not Found.

        forced ~$ tail -f /home/admin/access_log | grep mywgettest
        osvdb.org – – [08/May/2014:13:11:43 -0500] “GET /mywgettest HTTP/1.0″ 404 1755 “-” “Wget/1.12 (linux-gnu)”

        Nothing else. No hits to robots.txt there.

  9. nnnnnnnnn@mailinator.com

    If there is no price listed must be expensive. Coming form a country where they themselves have double standards and no ethics, but cry ethics when it suits them, the US. It doesn’t surprise me.

  10. I sent in an ethics complaint to the link at the bottom of this page, probably nothing will come of it and I don’t really have the details.

  11. Comments are being closed at this point. If you would like to provide further feedback, you are welcome to email moderators-at-osvdb.org.

Follow

Get every new post delivered to your Inbox.

Join 5,026 other followers

%d bloggers like this: