rnoaa: Getting county level rain data

Hi there,

Using the homr page on the noaa website, I downloaded a list of all the US weather stations. I then dropped any non active stations, any stations that don’t start with US1, and merged this data with fips code dataset. My plan is go get rain data for one day for every station and then collapse at the county level.

Here is my R code so far and the stackoverflow link

df <- dataframe$ghcnd

Grabbing necessary column

This gives me an output like:

[1] "GHCND:US1AKAB0058" "GHCND:US1AKAB0015" "GHCND:US1AKAB0021" "GHCND:US1AKAB0061"
 [5] "GHCND:US1AKAB0055" "GHCND:US1AKAB0038" "GHCND:US1AKAB0051" "GHCND:US1AKAB0052"
 [9] "GHCND:US1AKAB0060" "GHCND:US1AKAB0065" "GHCND:US1AKAB0062" "GHCND:US1AKFN0016"
[13] "GHCND:US1AKFN0018" "GHCND:US1AKFN0015" "GHCND:US1AKFN0011" "GHCND:US1AKFN0013"
[17] "GHCND:US1AKFN0030" "GHCND:US1AKJB0011" "GHCND:US1AKJB0014" "GHCND:US1AKKP0005"
[21] "GHCND:US1AKMS0011" "GHCND:US1AKMS0019" "GHCND:US1AKMS0012" "GHCND:US1AKMS0020"
[25] "GHCND:US1AKMS0018" "GHCND:US1AKMS0014" "GHCND:US1AKPW0001" "GHCND:US1AKSH0002"
[29] "GHCND:US1AKVC0006" "GHCND:US1AKWH0012" "GHCND:US1AKWP0001" "GHCND:US1AKWP0002"
[33] "GHCND:US1ALAT0014" "GHCND:US1ALAT0013" "GHCND:US1ALBW0095" "GHCND:US1ALBW0087"
[37] "GHCND:US1ALBW0020" "GHCND:US1ALBW0066" "GHCND:US1ALBW0031" "GHCND:US1ALBW0082"
[41] "GHCND:US1ALBW0099" "GHCND:US1ALBW0040" "GHCND:US1ALBW0004" "GHCND:US1ALBW0085"
[45] "GHCND:US1ALBW0009" "GHCND:US1ALBW0001" "GHCND:US1ALBW0094" "GHCND:US1ALBW0013"
[49] "GHCND:US1ALBW0079" "GHCND:US1ALBW0060"

In reality, I have about 22,000 weather stations. This is just showing the first 50.

rnoaa code

library(rnoaa)
options("noaakey" = Sys.getenv("noaakey"))
Sys.getenv("noaakey")

weather <- ncdc(datasetid = 'GHCND', stationid = df, var = 'PRCP', startdate = "2020-05-30",
                enddate = "2020-05-30", add_units = TRUE)

Which produces the following error:
Error: Request-URI Too Long (HTTP 414)

However, when I subset the df into just, say, the first 100 entries, I can’t get data for more than the first 25. However, the package details say I should be able to run 10,000 queries a day.

Loop Attempt

for (i in 1:length(df)){
  weather2<-ncdc(datasetid = 'GHCND', stationid=df1[1],var='PRCP',startdate ='2020-06-30',enddate='2020-06-30',
          add_units = TRUE)
  
}

But this just produces the warning Sorry, no data found.

If anyone could give advise on what to try next that would be great :slight_smile:

1 Like

Thanks for your question @woodhouse123

The 414 error is not specific to NOAA or rnoaa, its a generic error when the URL is too long. What you have with ncdc(datasetid = 'GHCND', stationid = df, var = 'PRCP', startdate = "2020-05-30", enddate = "2020-05-30", add_units = TRUE) is a very long URL because df is a long vector. There’s not an easy way for rnoaa to help users avoid this.

This is because the default limit is 25 results for ncdc(). If you look at the documentation or print out the function

ncdc <- function(datasetid=NULL, datatypeid=NULL, stationid=NULL, locationid=NULL,
  startdate=NULL, enddate=NULL, sortfield=NULL, sortorder=NULL, 
limit=25, offset=NULL,
  token=NULL, includemetadata=TRUE, add_units=FALSE, ...)
{

Also, the loop probably wouldn’t work because df1 is not defined anywhere, unless you meant df? And I’d think you’d do df[i] if your incrementer is the letter i

1 Like

@sckott

I’ve just re-run it like this:

df1 <- df[1:100] ## Splitting dataframe. Too big otherwise

for (i in 1:length(df1)){
  weather<-ncdc(datasetid = 'GHCND', stationid=df1[i],var='PRCP',startdate ='2020-06-30',enddate='2020-06-30',
                add_units = TRUE)
  
}

I get a bunch of warnings for no data, and then a dataset that has only a single row. The observation is the 100th station from df1.

So maybe this is just a problem with my loop ?

@sckott
Alternatively, this works and seems to be the most queries I can do at once.

df1 <- df[1:125] ## Splitting dataframe. Too big otherwise


weather <- ncdc(datasetid = 'GHCND', stationid = df1, var = 'PRCP', startdate = "2020-05-30",
                enddate = "2020-05-30", add_units = TRUE, limit = 125)

If I had a way to do this in a loop so that next I the next 125, and then the next 125, etc. Until I do all ~22,000 then that would be great…

maybe something like this

# say you want to split up your station ids into chunks of 5 each, as an example
z <- split(df, ceiling(seq_along(df)/5))
out <- list()
for (i in seq_along(z)) {
  out[[i]] <- ncdc(datasetid = 'GHCND', stationid = z[[i]], var = 'PRCP', 
    startdate = "2020-05-30", enddate = "2020-05-30", 
    add_units = TRUE, limit = 125)
}
1 Like

@sckott Thanks so much! This is really helpful. I think I’ll be able to get something like this to work. Been making some adaptations. I will keep in touch!

1 Like

OK I got it working I believe!

z <- split(df, ceiling(seq_along(df)/100))
out <- list()
for (i in seq_along(z)) {
  out[[i]] <- ncdc(datasetid = 'GHCND', stationid = z[[i]], var = 'PRCP', 
                   startdate = "2020-05-30", enddate = "2020-05-30", 
                   add_units = TRUE, limit = 100)
}

This allows me to download for all 21,882 stations at once. Although I believe some of them have no data.

My output is a list of 219 elements, each has two elements.

For instance, list once has out[[1]]$meta and out[[1]]$data

What I’m interested in is combining the rows from the 219 out[[i]]$data. Would this require a for loop?

probably

dplyr::bind_rows(lapply(out, "[[", "data"))
1 Like

Woohoo! That did it!! Thank you!!