rerddap - get gridded data in month chunks

rerddap
rnoaa
Tags: #<Tag:0x00007f57fa2ad830> #<Tag:0x00007f57fa2acde0>

#1

Someone asked about how to get data in month length chunks from an individual dataset with rerddap - a general purpose R client for ERDDAP servers.

The following are my thoughts on this:

library("rerddap")
library("dplyr")
library("lubridate")

We’ll use the dataset hycom_gom310S, found at http://upwell.pfeg.noaa.gov/erddap/griddap/hycom_gom310S.html

get info on the dataset

(res <- info('hycom_gom310S'))
#> <ERDDAP info> hycom_gom310S 
#>  Dimensions (range):  
#>      time: (2009-04-02T00:00:00Z, 2014-08-30T00:00:00Z) 
#>      latitude: (18.09165, 31.96065) 
#>      longitude: (-98.0, -76.40002) 
#>  Variables:  
#>      emp: 
#>          Units: kg/m2/s 
#>      mld: 
#>          Units: m 
#>      mlp: 
#>          Units: m 
#>      qtot: 
#>          Units: w/m2 
#>      ssh: 
#>          Units: m 
#>      surface_salinity_trend: 
#>          Units: PSU/day 
#>      surface_temperature_trend: 
#>          Units: degC/day 

get start and end times

(times <- res$alldata$NC_GLOBAL %>% 
  filter(attribute_name %in% c('time_coverage_end', 'time_coverage_start')) %>% 
  select(value) %>% 
  .$value)
#> [1] "2014-08-30 UTC" "2009-04-02 UTC"

make month long time periods

times <- lubridate::ymd_hms(times)
starts <- seq(times[2], times[1], by = "1 month")
starts[1:3]
#> [1] "2009-04-02 UTC" "2009-05-02 UTC" "2009-06-02 UTC"

loop over each month with griddap()

set read=FALSE to only download files

out <- list()

for (i in seq_along(starts)) {
  dates <- c(ymd(starts[i]), ymd(starts[i] + days(30)))
  cat("getting ", as.character(dates[1]), 
      " to ", as.character(dates[2]), "\n")
  out[[i]] <- griddap(
    res,
    time = dates,
    latitude = c(20, 21),
    longitude = c(-80, -81),
    read = FALSE
  )
}

inspect

length(out)

read in netcdf files

out2 <- lapply(out, function(z) ncdf4::nc_open(z$summary$filename))
vapply(out2, class, "")
#> [1] "ncdf4" "ncdf4" "ncdf4"

OR - read in within the for loop

out_dfs <- list()

for (i in seq_along(starts[1:3])) {
  dates <- c(ymd(starts[i]), ymd(starts[i] + days(30)))
  cat("getting ", as.character(dates[1]), 
      " to ", as.character(dates[2]), "\n")
  out_dfs[[i]] <- griddap(
    res,
    time = dates,
    latitude = c(20, 21),
    longitude = c(-80, -81),
    read = TRUE
  )
}

Then, data is already there, but reading in the data does take longer.

Then, bind all together in one data.frame

tbl_df(bind_rows(lapply(out_dfs, "[[", "data")))
#> # A tibble: 67,704 × 10
#>                    time      lat       lon           emp      mld       mlp      qtot
#>                   <chr>    <dbl>     <dbl>         <dbl>    <dbl>     <dbl>     <dbl>
#> 1  2009-04-02T00:00:00Z 19.98216 -81.00000  1.001745e-07 12.47464  9.482147 -193.3023
#> 2  2009-04-02T00:00:00Z 19.98216 -80.96002 -9.975844e-07 11.03450  4.705999 -194.2480
#> 3  2009-04-02T00:00:00Z 19.98216 -80.91998 -2.570382e-07 13.87560  9.874534 -188.5700
#> 4  2009-04-02T00:00:00Z 19.98216 -80.88000 -3.189338e-06 12.41764  4.614823 -190.0845
#> 5  2009-04-02T00:00:00Z 19.98216 -80.83997 -4.883811e-06 16.26543 13.045880 -187.4980
#> 6  2009-04-02T00:00:00Z 19.98216 -80.79999 -7.715149e-06 13.75597  4.588535 -190.6221
#> 7  2009-04-02T00:00:00Z 19.98216 -80.76001 -8.380262e-06 16.91521 12.527599 -187.1993
#> 8  2009-04-02T00:00:00Z 19.98216 -80.71997 -1.041871e-05 12.47726  4.570753 -188.3784
#> 9  2009-04-02T00:00:00Z 19.98216 -80.67999 -1.129382e-05 11.95203  5.426512 -186.7068
#> 10 2009-04-02T00:00:00Z 19.98216 -80.64001 -1.149297e-05 16.31684 10.040488 -183.0806
#> # ... with 67,694 more rows, and 3 more variables: ssh <dbl>, surface_salinity_trend <dbl>,
#> #   surface_temperature_trend <dbl>

#2

Hi Scott,

Thanks so much for providing such valuable feedback to my query - it is a great help and greatly appreciated.

I will try the above suggestion and let you know how I get on. I do have an additional question though.

While experimenting with the rerddapp package on the hycom datasets available @“http://upwell.pfeg.noaa.gov/erddap/” I was a bit stumped by some of the outputs I managed to extract.

I am interested in the domain latitude = c(-11,-38),longitude = c(110,130) which covers Western Australia. When I queried the ‘nrlHycomGLBu008e909S’ dataset which has the following dimensions:

Dimensions (range):  
     time: (2012-01-01T00:00:00Z, 2013-08-20T00:00:00Z) 
     latitude: (-80.0, 80.0) 
     longitude: (0.0, 359.9200439453125) 
 Variables:  
     surf_el: 
         Units: m 

I expected that using the following would capture the required spatial domain from the dataset (detailed above):

res <- griddap('nrlHycomGLBu008e909S',
           time = c('2012-01-01','2012-01-02'),
           latitude = c(-11,-38),
           longitude = c(110,130),
           store = disk('~/Documents'))

However, that longitude specification is actually for South America and I needed to add 180 degrees (i.e. longitude = c(290,310) to capture data for Western Australia.

The reverse occurs when I use the “nrlHycomGLBu008e909S_LonPM180” dataset. Here I have to subtract 180 degrees from this dataset to get data covering Western Australia.

The resulting data has incorrect longitude values in both cases (one 180 degrees greater than expected, and one 180 less than expected). Is this an error on my part with the griddap call I made (both calls were the same except for the dataset called against) or do these datasets have thier 0 longitude set at the international date line rather than the Greenwich prime meridian?

I tried looking at the metadata to see what I could make out but couldn’t find anything. I also tested the griddap call against the another data set ‘hawaii_soest_60de_2b3a_7e11’ (which I should mention wasn’t from the Naval Research Laboratory or Naval Oceanographic Office). This worked perfectly with the code below which specified what I initially thought was the correct spatial domain:

res_test <- griddap('hawaii_soest_60de_2b3a_7e11',
                     time = c('2013-08-21','2013-08-22'),
                     latitude = c(-11,-38),
                     longitude = c(110,130),
                     fields='ssh',
                     store = disk('~/Documents'))

If there is not an incorrect griddap call associated with the datsets where I was having longitude issues, and is related to the dataset itself - is it possible to alter the resulting netccdf file to have the right longitude values (i.e. instead of having longitude values which are 180 greater than they should be)?

Thanks again for your help with the time subsetting/looping help. It is greatly appreciated.

Regards Dave


#3

Hi again,

so I have been playing with the code that Scott so kindly provided to get monthly chunks of grid data.

The extraction works well and grabs the data, however, I think I have hit a bottleneck as I end up getting the following message:

Error: HTTP Status 500 - There was a (temporary?) problem.  Wait a minute, then try again.  (In a browser, click the Reload button.)
(ERROR from data source: dods.dap.DODSException: Connection cannot be read http://localhost:8080/thredds/dodsC/pacioos/hycom/global.dods?v[115:115][0:9][990:1366][448:698])
(Cause: dods.dap.DODSException: Connection cannot be read http://localhost:8080/thredds/dodsC/pacioos/hycom/global.dods?v[115:115][0:9][990:1366][448:698])

This occurs when I either run the code to grab monthly chunks or daily chunks for the spatial domain I am interested in. When just grabbing daily chunks, I managed to get 114 days of data.
The griddap call was for two variables, using the following:

Get info on the dataset of choice from erddap server

(res <- info('HYCOM_Global_3D'))

Get start and end times of erddap dataset chosen

(times <- res$alldata$NC_GLOBAL %>% 
  filter(attribute_name %in% c('time_coverage_end', 'time_coverage_start')) %>% 
  select(value) %>% 
  .$value)

make month long time periods

times <- lubridate::ymd_hms(times)
starts <- seq(times[2], times[1], by = "1 day")
starts[1:3]

loop over each month with griddap

# set read = false to only download files
out <- list()

for (i in seq_along(starts)) {
  dates <- c(ymd(starts[i]), ymd(starts[i] + days(1)))
  cat("getting ", as.character(dates[1]), 
      " to ", as.character(dates[2]), "\n")
  out[[i]] <- griddap(
    res,
    time = dates,
    latitude = c(-11, -38),
    longitude = c(110, 130),
    fields= c('eastward_sea_water_velocity','northward_sea_water_velocity'),
    read = FALSE,
    url = "http://upwell.pfeg.noaa.gov/erddap/"
  )
}

I assume the error is in relation to either requesting a subset which is too large (in size) or in the case of the daily, requesting data too many times?

I guess this measn I should be trying to request data directly from the OPENDAP server rather than via the ERDDAP server.

I came across this post which uses the old ncdf R package - which I believ has been replaced by ncdf4

https://publicwiki.deltares.nl/display/OET/OPeNDAP+subsetting+with+R

Im not sure of how to implement this using the ncdf4 package. I understand if this is not the appropriate forum, but thought I would continue the post in case someone was able to assist and thought any answers could be of benefit to others.

Regards Dave


#4

@dabdo Regarding the longitude values:

I will add some details to the documentation for griddap - extending what I have in ?search_adv man file

@param minLon,maxLon (numeric) Minimum and maximum longitude. Some datasets
have longitude values within -180 to 180, others use 0 to 360. If you
specify min and max Longitude within -180 to 180 (or 0 to 360), ERDDAP will
only find datasets that match the values you specify. Consider doing one
search: longitude -180 to 360, or two searches: longitude -180 to 180,
and 0 to 360.

So, there are two possible longitude sets: 1) -180/180, or 2) 0/360

We do check within griddap() that longitude/latitude and other dimension arguments passed in are within the data range (min and max) in the metadata for the dataset. However, with latitude this doesn’t really help for a set of values. E.g., for the nrlHycomGLBu008e909S dataset, the longitude range is basically 0-360. If you put in any positive number it will be sent to the server and you’ll get a misleading result. If you put in a negative number you would get an error, however. , e.g,

res <- griddap('nrlHycomGLBu008e909S',
              time = c('2012-01-01','2012-01-02'),
              latitude = c(-11, -18),
              longitude = c(-110, 120))
#> Error: One or both longitude values (-110, 120) outside data range (0, 359.920043945312)

And with a longitude range that’s -180/180, any value passed in between those two numbers will be passed through.

If yo have any ideas about how to automatically make this work, I’d love to hear them.

For now, best advice is to check whether the dataset expects 0/360 or -180/180, which can be done programmatically

x <- info('nrlHycomGLBu008e909S')
subset(x$alldata$longitude, attribute_name == "actual_range", "value")$value
#> [1] "0.0, 359.9200439453125"

I think we want to make sure the query is correct instead of fixing the file.


#5

Yes, one or the other is my guess.

I have been looking for information on rate limiting for ERDDAP servers, but haven’t found anything yet. I will let you know if I do, and will see what can be done once found.

I don’t know the answer to this. It’s possible OPENDAP could not do this rate limiting, but I don’t know. It could be better, worse, the same. Did you try it yet?

Yes, ncdf4 is the best NetCDF package to use.