Temporary Files


#1

A couple of months ago I posted this about how to store data and metadata read in from standardized instrument files and reports. We’re thinking about going with the option of storing the file header data in temporary files and having a ingest_header() function to read them back in. Our current plan is for the files to go into tempdir() with a name that is a sha1 has of the full file name with any non-alphanumeric characters replaced with ‘_’. This should make sure we avoid any special characters (’=’) from the hash, and hashing should make sure we avoid any character number limits on the filename imposed by different OSes.

Does anyone have any advice or pitfalls about using tempdir() like this? We couldn’t think of any other use cases that posed problems for the approach, but we would rather incorporate them now from the beginning rather than start from scratch.


#2

Hi Joe, good question. I’m not quite sure I fully understand the objective here, so my comments my be off the mark. Why temporary storage? Why write this to disk at all, rather than just maintain it as an R object in memory?

Usually we store on disk because we want persistent storage or because things are too big for memory. (tempdir can be helpful in the latter case, e.g. as in the raster package – but may not be done in an optimal way, as @noamross and others discussed in an earlier thread).

My intuition is that the functionality you imagine should probably be abstracted away from the particular details of the storage (c.f. how the memoise package permits it’s storage to work in memory, on disk, or in the cloud, with a nice interface for users to expire this storage or allow it to expire automatically with timeouts).

Again sorry if this isn’t any help, no doubt my lack of understanding of the process.


#3

Since tempdir is cleaned up at the end of the session, the files don’t persist between sessions, and it seems like the data is pretty small, so as Carl says maybe makes more sense to just keep in memory?


#4

@cboettig the issue we’re dealing with is bringing in what are essentially two datasets, of which usually only one will be of interest. Our imagined use case is that most users want to get at the data stored in the file, but in some cases they may want the data stored in the header of the file (e.g. serial number, firmware, calibration constants). We’d like to return the former as a dataframe that can be immediately manipulated rather than as the first element of a list to be subset. We thought caching the header data into a temporary file would allow easy access for anyone who does want/need info from the header. It seems that bringing it into memory seems to mean creating two objects (which we have decided not to do), or creating a list of objects. I will have to look at memoise to see if it is something we could utilize.

@sckott - we’re okay with losing the data at the end of the session. We expect the user to have taken whatever they need and incorporated it into the output dataset.

Just looked at memoise, it seems like a promising way to solve the problem. Though it may end up being something we build to rather than start with.