I am currently developing a package I would like to submit to rOpenSci at some point this year. The package obtains current and historica data for hurricanes in the Atlantic and east Pacific oceans (eventually, central Pacific as well).
I already have a similar package in cran, HURDAT. However, this is a re-analysis project so the data may be slightly different (but has far more storms).
I’m seeking guidance/advice on the best way to finish off the former project; rrricanes. Here’s where it stands now:
- scrapes real-time and archived data for tropical cyclones back to 1998, Atlantic and east Pacific (east of 140W unstrict).
- Parses in detail three products: Forecast/Advisory, Strike Probabilities, Wind Speed Probabilities.
- Parses loosely Storm Discussions, Public Advisory, Position Estimates, Updates
It is very slow; specifically with Forecast/Advisory products. These products contain the bulk of the data. However, the format is rather free; worse the earlier it is. Plus, it has slightly changed over the years. I’ve used a lot of regex (for me, anyway) to handle the different combinations.
Additionally, it’s accessing HTML pages which may or may not be available right away. On my somewhat slow machine, it may take 10-15 minutes to get one year’s worth of data in a nice clean structure. No good!
Originally I had avoided even thinking of keeping the datasets within the package; they’re about 30-40MB (in csv format) for just the current three core products (forecast/advisory, (wind|strike) probabilities). But, I don’t think anyone would realistically want to use it just because of how slow it is. Hell, I’m not even sure I want to use it and I’m writing it for me!!!
So, that being said, these are my thoughts:
- A seperate repo to hold data alone. Not in a package repo; just a simple dataset GitHub repo. There would be three types of datasets:
- A summary dataset by year (cyclone name, key, start date, end date)
- A dataset for each year by product; one dataset for all forecast/advisory data or strike probabilities, etc.
- A dataset for each storm by product.
The latter item is optional; I’m not sold on it yet. The largest dataset for an annual product will be about 700x125 so I don’t see any reason someone couldn’t just import that and do individual storm analysis off that.
And then the actual package repo. The package would maintain the same functionality of being able to scrape data. But, it would be given additional functionality to just pull datasets from GitHub.
Of course, this means I would need to update the GH dataset repo multiple times a day during an active cyclone but I can either do that on my local machine or on AWS.
All of this while keeping in mind I would like to submit the package to rOpenSci as, from what I’ve read, it seems elligible but is currently lacking.
My future plans for this package are not just to storm advisory data. I would also like to add
- Add GIS archives (back to 2008)
- reconnaissance data
- computer forecast models (not the larger ones but simple forecast plots)
- ship/buoy observations
I welcome any advice or suggestions on the best path forward here. Thank you.