Advice on further development of package

timtrice · May 25, 2017, 8:11pm

I am currently developing a package I would like to submit to rOpenSci at some point this year. The package obtains current and historica data for hurricanes in the Atlantic and east Pacific oceans (eventually, central Pacific as well).

I already have a similar package in cran, HURDAT. However, this is a re-analysis project so the data may be slightly different (but has far more storms).

I’m seeking guidance/advice on the best way to finish off the former project; rrricanes. Here’s where it stands now:

beta
scrapes real-time and archived data for tropical cyclones back to 1998, Atlantic and east Pacific (east of 140W unstrict).
Parses in detail three products: Forecast/Advisory, Strike Probabilities, Wind Speed Probabilities.
Parses loosely Storm Discussions, Public Advisory, Position Estimates, Updates

It is very slow; specifically with Forecast/Advisory products. These products contain the bulk of the data. However, the format is rather free; worse the earlier it is. Plus, it has slightly changed over the years. I’ve used a lot of regex (for me, anyway) to handle the different combinations.

Additionally, it’s accessing HTML pages which may or may not be available right away. On my somewhat slow machine, it may take 10-15 minutes to get one year’s worth of data in a nice clean structure. No good!

Originally I had avoided even thinking of keeping the datasets within the package; they’re about 30-40MB (in csv format) for just the current three core products (forecast/advisory, (wind|strike) probabilities). But, I don’t think anyone would realistically want to use it just because of how slow it is. Hell, I’m not even sure I want to use it and I’m writing it for me!!!

So, that being said, these are my thoughts:

A seperate repo to hold data alone. Not in a package repo; just a simple dataset GitHub repo. There would be three types of datasets:
- A summary dataset by year (cyclone name, key, start date, end date)
- A dataset for each year by product; one dataset for all forecast/advisory data or strike probabilities, etc.
- A dataset for each storm by product.

The latter item is optional; I’m not sold on it yet. The largest dataset for an annual product will be about 700x125 so I don’t see any reason someone couldn’t just import that and do individual storm analysis off that.

And then the actual package repo. The package would maintain the same functionality of being able to scrape data. But, it would be given additional functionality to just pull datasets from GitHub.

Of course, this means I would need to update the GH dataset repo multiple times a day during an active cyclone but I can either do that on my local machine or on AWS.

All of this while keeping in mind I would like to submit the package to rOpenSci as, from what I’ve read, it seems elligible but is currently lacking.

My future plans for this package are not just to storm advisory data. I would also like to add

Add GIS archives (back to 2008)
reconnaissance data
computer forecast models (not the larger ones but simple forecast plots)
ship/buoy observations

I welcome any advice or suggestions on the best path forward here. Thank you.

Tim Trice

sckott · May 29, 2017, 4:46pm

What does this mean exactly?

How often is data updated? Trying to get a sense if it makes sense to keep data in the package - if data updated very often vs. data updated like once per month or less ?

30-40 MB is def. too much to put in a pkg, as you prob. know

In which ways?

I want to see if we can speed up your download times - can you point to that code that you say is slow?
rnoaa does buoy data - though it may be different from the buoy data you’re talking about.

timtrice · May 29, 2017, 7:06pm

sckott, thank you for responding!

Let me first point out I renamed the package to rrricanes and updated the repo links above to reflect that. It is also now connected to rrricanesdata which holds most of the data in Rda format. I renamed it as I would like to push markdown reports of active storms to twitter and rrricanes was available. Thank purrr for the inspiration!

The scraping utility gets timeout errors frequently. It’s not just my local connection as Travis has timed out occasionally as well. I plan on writing a tryCatch with maybe three attempts to reconnect on timeouts but it’s not a priority at this point.

Data for years prior to current year will never be updated or changed. That theoretically could go into the pckage. The Rda files with compression_level 9 sit at 1.3MB right now but that’s only for forecast data and strike probabilities.

Data for the current season would be updated at a minimum every 6 hours for active storms and possibly more frequently during drastic changes or landfalling situations. My thinking here was running a cron job that would grab the new data and push it to the data repo. But users will still have the scraping functions available as a backup.

I haven’t entirely read over the expectations of the community for new projects. Particularly that I’m the only user and the package is still in testing, I just don’t know if the community would accept it in it’s current state. Although I have run some validation tests and removed a significant chunk of bugs, some data quality though may still be lacking either because of one little thing that throws off the regex

Two cases I had just found: Some advisories used \r and \n instead of \n found in most products. Another instance; some products are in proper-casing while others are upper-casing.

The Forecast/Advisory products are by far the slowest. Here is an example:

# Get list of Atlantic storms for 2008
system.time(al2008 <- get_storms(year = 2008, basin = "AL"))

user  system elapsed 
  0.192   0.000   0.779

# Load AL092008 (Ike)
system.time(al092008 <- al2008 %>% slice(9) %>% .$Link %>% get_fstadv())

user  system elapsed 
  9.388   0.196  42.029

The new function pulling post-scraped datasets from rrricanesdata for an entire year:

system.time(al2008.fstadv <- load_storm_data(years = 2008, basins = "AL", products = "fstadv"))

user  system elapsed 
  0.112   0.008   1.036

There are 125 variables. During scraping, some are transformed (138.3W in the text to -138.3 in the dataframe or date/time values like 08/0900Z transformed to proper ymd_hms). Variables such as wind radius are moved from a long format to a wide format so each advisory has only one row in one table.

If you’re only getting data for one storm, this isn’t an issue. It might take a minute tops (depending on the life of the storm). But to get a whole season of data just for one basin will take several minutes.

It could also be because I’m using rvest. Intially grabbing the data and extracting the product text could possibly be done better.

I have this bookmarked and thank you for reminding me! Ship and buoy data will add up so I’m not sure yet how I’m going to handle it. Once I get the GIS data added though I’d like to move to the reconnaissance data as I believe there is more value there.

Apologies if some of my responses are long-winded!

Tim Trice

sckott · June 1, 2017, 5:57pm

Thanks for the responses.

I’ll have a look at some of the code and see if I can see something.

A nice way to look at what parts of your code are slow is GitHub - rstudio/profvis: Visualize R profiling data - in case you didn’t know about it.

No problem that you think you are the only user - its entirely possible others are using it - We do want submitted pkgs to have tests though, so mind that rule. There’s in general no issue with an early stage pkg being submitted, and in fact it’s nice to get a pkg in review at earlier stage to get feedback early rather than after the maintainer is in a way wedded to the pkg structure/conventions.etc.

timtrice · June 1, 2017, 6:29pm

Scot, that would be very helpful! Thank you. I apologize if the repo seems disorganized. I am working on getting a solid structure. Currently I’m updating the docs on the master branch to reflect the name change and doing minor cleaning.

And I’ll check out profvis. I did use it a bit when I first created the package and I do remember fetching the URL’s was the most time-consuming. But I still think I can clean up the code their a bit to make it slightly faster. I’ll use it again to test various ideas.

The package is definitely tested. Probably too much (I removed some tests in version 0.1.1). Many of the tests were to handle the minor fluctuations from one advisory to another (i.e., Adv 1 and Adv 1…COR) - make sure by fixing one thing I didn’t break the other. I’m trying to think of a way to branch have a develop branch and a test branch so the develop branch won’t take so long to build. Not sure if that’s feasible or even acceptable; just an idea.

I think I will go ahead and submit the package and see what the community thinks. I’m a bit overwhelmed with it in the sense of I know what I need to do but I may not be documenting it correctly or following proper procedures. Additionally, I know how things are supposed to work. Not sure I’m expressing it correctly to others.

Thanks again for the advice and any help you can offer!

Tim Trice

sckott · June 5, 2017, 9:03pm

Thanks for your submission to onboarding. https://github.com/ropensci/onboarding/issues/118

We’ll be getting to it soon with editor checks and so on.

Topic		Replies	Views
Mirroring a website for data scraping Package Use Questions rrricanes	5	1628	October 5, 2017
Data only packages Package Development	10	4081	February 14, 2019
Use of some ropensci packages on GitHub UseCases	3	3735	April 12, 2015
Informal package review using rOpenSci review template UseCases software-peer-review , dev-guide	0	1148	August 4, 2021
Data license visibility General Q&A	18	1443	September 13, 2018

Advice on further development of package

Related topics