How to avoid space-hogging raster tempfiles?

geospatial
servers
Tags: #<Tag:0x00007f8263f5cf70> #<Tag:0x00007f8263f5cae8>

#1

I run a shared lab server that we use for big computational jobs, which often includes geospatial routines using the raster package. raster has the neat feature of being able to work with bigger-then RAM data by storing it on disk. When raster objects get to large to handle in memory, it will automatically move them to disk as temporary files in the /tmp/ directory.

This can result in the rapid accumulation of lots of large tempfiles. Moreover, many users of the server aren’t aware of this, as /tmp/ is a top-level directory that’s not visible to most users. A user doing some ambitious geoprocessing can fill up the hard drive and grind all jobs to halt pretty easily.

Thanks to some feedback from rOpenSci colleagues, I came up with a 3-part solution to this:

  1. Clear out tempfiles more frequently: By default, raster keeps tempfiles between sessions, removing only those of a certain age when the session begins. The default age is a week. Frankly, none of my users reuse these tempfiles as far as I can tell. So I changed the age to one hour, by adding options(rasterTmpTime = 1) to my Rprofile.site file. Practically, many users just leave their RStudio session open indefinitely, though, so in many cases this doesn’t prevent accumulation.

  2. Move items to disk less frequently: raster has some nice machinery to estimate the memory needed for a task and move data to disk if it is not available. However, it turns out that it also has a default upper limit of 100MB for any task. I have much more RAM than this available, so I also added options(rasterMaxMemory = 1e10) to change this to 10GB.

  3. Move tempfiles to users’ own home directories: Users weren’t generally aware of the accumulation of temporary files because they were outside of their own directories. I wanted each user to have their own tempfile directory so they could view and delete temporary files they generated.

    In theory this is supposed to be set by the TMP or TEMPDIR environment variables, but because of the way this is implemented in R, one can’t set a user-specific path like ~/tmp/ or $HOME/tmp/ in Renviron.site; path expansion is not performed. Thankfully, Simon Urbanek’s unixtools package (GitHub only) has a nice utility for resetting the temp directory. I’ve installed this on my server and added these lines to Rprofile.site:

    if(!dir.exists(path.expand("~/tmp"))) dir.create(path.expand("~/tmp"))
    unixtools::set.tempdir(path.expand("~/tmp"))
    

    Now all my users have visible ~/tmp/ directory in their home folder where temporary R session files go. It’s easy for them to see when they are using a lot of space, and easy for me to see who is taking up space with a quick sudo ncdu /home.

P.S.: You’ll find my Rprofile.site and the rest of the configuration in my server Docker image


#2

Minor comment: I think the reason why it doesn’t work to set TMPDIR (sic!) in Renviron is because R sets up the temp folder very early on in the startup process way before Renviron/Rprofile files are processed. The TMPDIR (or TMP or TEMP) must be set before launching R, e.g.

$ TEMPDIR=$HOME/tmp R -e "tempdir()"

or simply in /etc/bashrc - though that would affect all processes, not just R.

Didn’t know about unixtools::set.tempdir() - looks neat.


#3

Great suggestions, thanks ! Point 2 seems particularly “promising”.

I’ll add my 2 cents: an alternative solution I find useful to avoid accumulation of large rasters is to explicitly specify to save to a “tiff” file (even temporary if I don’t need it later) whenever possible. Most raster functions (e.g., calc, overlay, crop, mask, etc.) allow this through the filename argument, and results are still accessible in the “usual” ways.
Since this allows also to specify compression options, using for example:

filename = tempfile(fileext = ".tif"),
options   = "COMPRESS=LZW"

allows to save a lot of space thanks to the efficiency of TIF compression, in particular for categorical data or data containing a lot of NAs (explicitly setting the datatype when possible, thus avoiding the “auto” saving as FLT4S can also help).

This has however some consequences on performance due to compressing/decompressing (I’d be curious to see some benchmarks concerning trade-offs between reduction in speed due to comp/decomp and (possible) gains due to reduced time needed to read/write on/to disk).