Using rnoaa to get a SPDF of NCDC normals

Tags: #<Tag:0x00007fa100a53c48> #<Tag:0x00007fa100a53b08>


Hi all, first post here and new to rOpenSci and in love with rnoaa. I’m teaching time series and spatial stats this quarter and rnoaa has opened up an incredible wealth of new data. I maintain a much smaller package and know all too well how hard it is and how often users confound you with unorthodox applications. So a thanks to start.

Here is what I want to do: make a spatial points data frame (SPDF) of mean annual temperature (MAT) and total annual precipitation (TAP) for the last climate normal for all the NCDC stations in the Pacific Northwest. I thought I would do this via dplyr::filter through state code WA, OR, ID or by a bounding box in sp. The final SPDF would have a data slot with two values in it for all the points: MAT and TAP. I kind of thought I would try something like this to get the stations:

sta <- ncdc_stations()
sta_sp <- sta$data
coordinates(sta_sp) <- ~latitude+longitude
plot(sta_sp) # only 25 points -- the first page of the stations

But I see that sta only contains the first page (25 points) of some 1.3E5 stations. And this returns some 3.2E6 records!

sta_norms<- ncdc(datasetid="NORMAL_ANN",startdate = "2010-01-01",
                 enddate = "2010-01-01")

So, I can see why the calls from rnoaa are conservative. You can’t have people querying millions of points like a dummy (looks in mirror). So clearly I need some help. My initial idea was to filter the data by state or by a bounding box but I’m unsure of how to best proceed. Any suggestions appreciated and sorry for not having more code to fuss with in my example. I’m not sure how to get started. I’ve been flamed by Ripley in my youth so I can handle being told I’m over my head.

Thanks in advance.


Happy to help!

You can search for stations with the extent parameter like ncdc_stations(extent=c(47.5204,-122.2047,47.6139,-122.1065))

All ncdc*() functions have pagination, so yes, 25 is default, but you can set it as high as 1000. And if you need more than that, use the offset parameter (e.g., request 1 [limit=1000, offset=0], request 2 [limit=1000, offset=1000], etc.)

Correct, but mostly that’s to help out NOAA to make sure 100 people aren’t all requesting 1 million data points all at the same time.

i have as well - not fun!

So for a game plan:

  • Query to get stations ncdc_stations with extent parameter (or you can use locationid to give a FIPs code - Use limit and offset params to get all the stations you need
  • Take station ids and feed those into ncdc to get the data you need. ncdc is also paginated, so a similar approach with limit and offset params


Alright. I got it done. A bit clunky yet but big progress:


# stations in Washington that have both PRCP and TAVG for NORMAL-ANN
wa_sta <- ncdc_stations(extent=c(45.33,-124.76,49,-116.92),
                        datatypeid = c("ANN-PRCP-NORMAL","ANN-TAVG-NORMAL"),
# get the normal data
wa_norms <- ncdc(datasetid="NORMAL_ANN",
                 datatypeid = c("ANN-PRCP-NORMAL","ANN-TAVG-NORMAL"),
                 stationid = wa_sta$data$id[1:100], # quits out at n > 128
                 startdate = "2010-01-01",
                 enddate = "2010-01-01",
# staple these together b/c the last call would only do 128 before a
# 500 error
wa_norms2 <- ncdc(datasetid="NORMAL_ANN",
                  datatypeid = c("ANN-PRCP-NORMAL","ANN-TAVG-NORMAL"),
                  stationid = wa_sta$data$id[101:198],
                  startdate = "2010-01-01",
                  enddate = "2010-01-01",

# make into a data.frame with coordinate info
# start with climate
wa_norms_all <- rbind(wa_norms$data[,c(2:4)],
wa_tavg <- filter(wa_norms_all,datatype=="ANN-TAVG-NORMAL")
wa_prcp <- filter(wa_norms_all,datatype=="ANN-PRCP-NORMAL")
# check to make sure stations are all the same
all(wa_tavg$station == wa_prcp$station)
# now combine into a data.frame
wa_norm.df <- data.frame(id=wa_tavg$station,tavg=wa_tavg$value,
# go get station data into a more useful format
wa_sta.df <- wa_sta$data
wa_sta.df <- wa_sta.df[,c(1,4,5,7,9)]

# this is what I want as a data.frame
wa_norm.df <- inner_join(wa_norm.df, wa_sta.df, by = "id")
# spatialize since I said that is what I wanted 
wa_norm.sp <- wa_norm.df
coordinates(wa_norm.sp) <- ~longitude+latitude

So it’s a bit clunky still as code but works fine. The only question I have is why I get an error:

Warning message:
Error: (500) - An error occured while servicing your request.

when running trying to get all the data from ncdc at one go:

wa_norms <- ncdc(datasetid="NORMAL_ANN",
                 datatypeid = c("ANN-PRCP-NORMAL","ANN-TAVG-NORMAL"),
                 stationid = wa_sta$data$id, # quits out at n > 128
                 startdate = "2010-01-01",
                 enddate = "2010-01-01",

But this is great! I’m on my way with this tool. And no flame from BDR…


the problem is the URL is too long - that is, the station ids, and other params, are sent in the URL as query parameters. when the URL is too long you should get a 414 http error resposne saying as much, but i guess NCDC isn’t giving the appropriate response. i’ll ping them and ask to fix it.

Anyway, no way to get around that. You’ll have to pass in a set of IDs that makes the URL short enough - not sure exactly what the character limit is to avoid 414, but looks like < 128 station ids as you said above


Fascinating! Thanks much for your help.