Mirroring a website for data scraping

timtrice · September 26, 2017, 7:08pm

Hello all,

I have a package, rrricanes, that scrapes the NHC website for hurricane data and returns it in tidy formats for analysis and plotting.

There have consistently been issues with portions of the website responding slowly or just not working at all. Currently, I’m having issues retrieving GIS data; the second time this month. This causes the package to break.

I do have a companion package, rrricanesdata that helps avoid some of these issues by archiving the data. But for recent data this will not work.

My question to the community is how can I best overcome this major obstacle? I was considering building a “mirror” of the National Hurricane Center website (minus graphics and only of the data meant for the package). This way, using python, I can grab the most recent data and ensure it is always available in a consistent format. Because of file size issues, I may only use it to cover the data that rrricanesdata does not cover.

Is this considered a valid option? Or are there other, better ways i should go about handling this?

I do not want my package to be considered in any way an alternative in emergency situations. But, having to wait on NOAA tech support to resolve issues can require a great deal of patience (nor do I have expectations they must fix things right away on my behalf).

Tim Trice

hrbrmstr · September 26, 2017, 7:43pm

One option is to mirror pertinent components of various NOAA web path trees
to GH and pull data from GH pages. Cld pretty easily hackishly setup a
Travis-CI job to do this ~hourly and only mirror what you don’t already
have or what’s changed. NOTE: Some of the files may be bigger than the GH
max, so this might not work.

A non-free side option to ^^ is to do the same but to S3. It’ll likely be a
few Starbucks-latte $ per month. However, Amazon may be open to these being
part of the freely available data sets they do gratis hosting for.

You could also do something like I did for tigris (not sure if markdown
makes it through via e-mail replies) and have a local user-dir cache and
then only update diffs. That cld even be a SQLite DB with BLOBs. You could
combine that with either of the above options as well.

data.world might be open to regular updates to hurricane data sets. They
are less likely to be around as long as GitHub or Amazon.

NOTE: more moving parts also == more places things can go wrong.

sckott · September 26, 2017, 9:36pm

Are there any licensing issues to worry about?

I do like the S3 option, or similar Digital Ocean now has Spaces.

Are files behind an ftp or an http server? Seems like that answer leads to diff. methods for collecting data.

One issue that matters is how often you want to update these files in your store. Is it every day? once a month? etc. If not that often maybe you can run a script from your own machine, but if very often then bob’s travis suggestion is a good one. Or Heroku setup to run once a day or so. If you want to do from a full ubuntu server, I can get you on one of mine.

hrbrmstr · September 26, 2017, 10:21pm

Ooh. I shld pick your brain sometime @sckott abt Spaces. I got into the
beta but never had time to check it out (also b/c S3 at $DAYJOB is free so
I didn’t have a ton of impetus

sckott · September 26, 2017, 10:33pm

(sounds good to me, haven’t played much with it yet either, but will soon)

timtrice · October 5, 2017, 4:14pm

@sckott and @hrbrmstr, thank you both for responding so quickly. Apologies I could not respond earlier.

No, sir, this is public domain data.

I have since learned from a NOAA met that both options exist and one (FTP) may make it a bit faster to get data.

I don’t want to cache segments of the site as that would just trap any errors. Unless I’m misunderstanding you.

I considered this viable to storing the raw text, zip files. Cost, I’m unsure of and really no idea of ballpark metrics to even get a reasonable cost estimate. But I could certainly do that and CF.

The problem I have with that is, as I understand it, giving public access to buckets shows private buckets as well (names, not the data). That bugs me for some reason.

Does SQLite store zips?

I explored the data.world route prior to rrricanesdata; I don’t believe it would help much on the current path due to size limitations.

So, the issue recently has been the GIS section of the website being changed or down. This wipes out half the functionality of rrricanes. However, I could drop dependency on the GIS to an extent and build in the functionality; a “build your own GIS”.

For example, there are two datasets that can already be built from the advisory products: spatial point and spatial line dataframes (past track and forecast). So, technically, do not need those types of GIS datasets.

Another is forecast cone; these are standard values and not based on dynamic input. So, it could be calculated and drawn on the fly (and much faster than downloading and moving to ggplot).

Watches and warnings is trickier; that text exists within the advisories but is not parsed (and is inconsistent from year to year). While it could theoretically be done, I’m not sure there is an advantage one way or another.

Other GIS datasets cannot be built “on the fly”; storm surge products, wind speed probabilities (more detailed than the text product). These require data not available in rrricanes.

I feel like I’m going down a rabbit hole a bit, here. In short, my thinking is archiving the data to a more structured, consistent format (if the last advisory GIS is unavailable, then you still have the previous one whereas, as it currently stands now, you could have access to none of it if the NHC site breaks again).

Archive the data - but then if you’re going to do that why not go ahead and parse the data (including GIS). Move into CSV or DB format. That makes accessing data much faster which removes the need (or modifiesthe purpose of) rrricanesdata.

This means the same for recon data and forecast model data, when added. And this is ultimately where I’m becoming concerned. Bob, you’re correct;

But I feel like having the two tools as they stand now, while good, are too heavily dependent on what I know is an unreliable source. And I have to overcome that.

Topic		Replies	Views
Advice on further development of package Package Development weather	5	1239	June 5, 2017
rrricanes to Access Tropical Cyclone Data Blog r , onboarding	6	967	November 27, 2017
Data only packages Package Development	10	4089	February 14, 2019
What if raw data in package is too large? General Q&A	4	938	February 19, 2020
US Federal Government Shutdown & Data	9	1273	January 24, 2018

Mirroring a website for data scraping

Related topics