PDA

View Full Version : I'm intending to mine some public-domain data from a government website.


Cort Haus
21-07-2009, 15:49:34
It will ultimately involve hitting around 25,000 pages of a govt department website. All these pages are public domain, but I'm not sure how they'd feel about their site being hit from the same IP several times a second for about an hour.

Would they say "ello, ello, ello. what's all this then?" and start investigating on grounds of 'Crown Copyright' and commercial use? Or is it fair game under Freedom of Information? The information I am accessing is public domain and published in the public interest as part of government policy.

King_Ghidra
21-07-2009, 15:56:19
i would guess fair access is somewhat different from intensive querying/mining. but the worst that will happen is some security software will jump up and down and block you as a threat, i doubt you'll get inspector knacker at your front door

Cort Haus
21-07-2009, 15:57:25
It occurs to me however that a serious question like this might be completely inappropriate for this forum. :-|

Funko
21-07-2009, 15:58:10
Most of the sites have copyright info eg.

http://www.number10.gov.uk/footer/copyright

Material on this site is subject to Crown copyright protection unless otherwise indicated. The material may be downloaded to file or printer without requiring specific prior permission. Any other proposed use of the material is subject to the approval of Her Majesty’s Stationery Office (HMSO).

Funko
21-07-2009, 15:59:49
Linked from there this looks like it has the info you might need, but I can't be arsed reading it. Good luck with that. :D

http://www.opsi.gov.uk/advice/psi-regulations/index

Cort Haus
21-07-2009, 16:00:02
i would guess fair access is somewhat different from intensive querying/mining. but the worst that will happen is some security software will jump up and down and block you as a threat, i doubt you'll get inspector knacker at your front door

Hmmm. If they'll block me they'll block the proxy server which would be baaad news. I shall proceed with caution.

Cort Haus
21-07-2009, 16:08:04
This is the specific site : http://www.dcsf.gov.uk/copyright/

I'll only be using a tiny fraction of each web-page, but I'll be downloading all the pages (one per school) to get at it.

Cort Haus
21-07-2009, 16:08:51
Thanks for the links, Funko.

Funko
21-07-2009, 16:10:48
Sounds like it depends whether it's for private use/research in which case it's fine, or commercial use in which case you need a licence.

Funko
21-07-2009, 16:11:22
Or if you aren't republishing it, also fine.

Japher
21-07-2009, 16:20:59
If it's a US gov site, they probably won't notice.

Cort Haus
21-07-2009, 17:35:31
The aim is not to republish, but to use in commercially-oriented research. What I'll probably do is to do some testing from my home account, and possibly go for collecting all the data from there.

Oerdin
21-07-2009, 22:22:02
If it's a US gov site, they probably won't notice.

Especially with all the attacks the Chinese government routinely launches against US government websites.

Cort Haus
22-07-2009, 15:59:09
The answer is that we need a PSI licence.

http://www.opsi.gov.uk/click-use/psi-licence-information/index

I'll still be running the script from home though. My CEO will hurl me from the roof of the building if I get our proxy server blocked from the guvmint site.

Japher
22-07-2009, 16:50:55
didilibobidido

MDA
22-07-2009, 18:29:13
You could try asking the government if its OK. I'm sure you'd get an answer in 3-5 years.

VetLegion
23-07-2009, 14:21:55
Or connect to a neighbour's WiFi network and do it that way, just in case.

Drekkus
23-07-2009, 14:34:00
ACHTUNG MINEN!!!

Cort Haus
23-07-2009, 19:31:25
Well, I pulled what turned out to be about 20,000 pages this afternoon and haven't been disappeared or even blocked yet. Huzzah!