Alrighty then. Sorry in advance for the wall of text. I submitted a pull request with changes.
It is awesome getting to work with someone who (unlike me) knows what they’re doing. I have never dug into ‘proper’ organization of functions like this, which is really cool.
Three things:
# 1
The way you have arranged these, the call to classifications() is embedded within the function.
With the data I have in front of me, I need to get the lowest common ancestor from 471 sets of taxon ids from one study. This guarantees at least 471 calls to classification()
, presumably each of which with at least two queries being passed. This can take an appreciable amount of time.
However, there is redundancy in the taxa across the sets: only 756 taxa are unique. Thus, it is far more efficient to make a single call to classification() using only non-redundant ids, then do subsetting of that list.
This might be a fringe case that’s not worth incorporating, but I expect it to be increasingly common in the coming years, thus I think the user should be able to supply a list (output of classification
). I’m not sure where to stick that, given I’m not so familiar with the UseMethod
or '...'
syntax (I suspect it is a necessity in order to ensure generality across taxonomy databases and respective functions?). But I went for it anyway!
# 2
I incorporated my idea about forcing the function to report the name of a specified taxonomic rank shared across taxa. As submitted, this is the behavior I was looking for, but I understand if it might be best practice to keep them separate.
# 3
I am not very familiar with taxonomic databases besides NCBI. I used this line
cseq <- vapply(idsc, function(x) x[1, 1] != "unclassified sequences", logical(1))
To catch taxa whose top level classification is “unclassified sequences” – presumably sequences from bulk environmental samples. This may be a quirk specific to NCBI. Are you familiar with any such issues in other taxon databases?
I get an error when passing ID numbers as numeric:
lc_helper(c(8023, 74940), classifications)
# Error in vapply(idsc, function(x) x[1, 1] != "unclassified sequences", :
# values must be length 1,
# but FUN(X[[1]]) result is length 0
I suppose this is fine provided there is good documentation elsewhere that IDs should be passed as characters.
The same line generates an error if any of the taxa are not found in a database. I added some obscurities as tests to the bottom of the code.