Taxize: Get rank of lowest common taxon


#1

I’ve used taxize to get the classification of a set of NCBI taxon IDs, and for some subset of those, am finding the taxonomic rank (e.g. ‘family’ or ‘subclass’) of the lowest common taxon. This may be used by folks interested in knowing what the level of taxonomic resolution tends to be across the matches returned for a large number of NCBI/blast queries.

In the output of classification(), the rank name can be “no rank”, which totally makes sense, but it would be useful if there were a consistent (numeric?) level indicating the level at which that node lies. I have written a quick hacky workaround, which finds the next best level and prefixes it with “below-”. Happy to share that if useful, but I imagine someone here can cook up something more clever.

Is this functionality already built into taxize, and I missed it in the documentation? Am I going about this all wrong (e.g. finding common node first, then query for that node’s classification)?


#2

Thanks for your question @jimmyodonnell !

Could you please? Curious to know what you’re doing to solve it. Or at least describe what you do (NOTE - that this forum supports syntax highlighting, see http://superuser.com/editing-help for help).

No, not already in the package


#3

Sure thing.

In my case, I have a long(er) vector of taxon IDs, I use taxize::classification() to get the taxonomic hierarchy of each of these, and store it as a list.

I then have short(er) vectors of taxon IDs, and I want to know what is the name, rank, and taxon ID of their lowest common (taxonomic) ancestor.

I wrote this hacky, ugly, etc etc etc, function that gets the job done in a pinch

next_best_taxon <- function(x){
	paste("below-",
		tail(x[,"rank"][!duplicated(x[,"rank"])], n = 1
		), sep = "")
}

LCA <- function(taxid_vec, class_list)
{
	# This function takes a (character) vector of NCBI taxids, 
	# and a list of classification hierarchies (from taxize)
	# outputs the name, rank, and taxid of the (taxonomic) lowest common ancestor
	if(class(taxid_vec) != "character"){
		taxid_vec <- as.character(taxid_vec)
	}
	relevant_class <- class_list[taxid_vec]
	# remove unclassified sequences
	# NOTE THIS IN THE METHODS "We ignored hits belonging to 'unclassified sequences'"
	classified_sequences <- sapply(relevant_class, function(x) x[1,1] != "unclassified sequences")
	relevant_class <- relevant_class[classified_sequences]
	LCA_row <- length(Reduce(intersect, lapply(relevant_class, "[[", 1)))
	# TODO add taxonomic rank retrieval -- e.g. c("class", "family")
	LCA <- relevant_class[[1]][LCA_row,]
	if(LCA[1,"rank"] == "no rank"){
		LCA[1,"rank"] <- next_best_taxon(relevant_class[[1]][1:LCA_row,]) # as.character(LCA_row)
	}
	return(LCA)
}

Example:

some_ncbi_taxon_ids <- c("9031", "9823", "9606", "9470")

some_ncbi_classifications <- classification(some_ncbi_taxon_ids)

LCA(some_ncbi_taxon_ids[c(2,3,4)], some_ncbi_classifications)
#             name        rank      id
# 21 Boreoeutheria below-class 1437010

LCA(some_ncbi_taxon_ids[c(1,2)], some_ncbi_classifications)
#       name            rank    id
# 17 Amniota below-subphylum 32524

#4

Thanks!

Though the function next_best_taxon() is missing, can you include that?


#5

Whoops! Edited above.


#6

@jimmyodonnell did some work on this, to make it more general to some of the types of inputs. Thoughts?

next_best_taxon <- function(x){
  paste("below-",
        tail(x[, "rank"][!duplicated(x[, "rank"])], n = 1
        ), sep = "")
}

lowest_common <- function(...){
  UseMethod("lowest_common")
}

lowest_common.default <- function(ids, db = NULL, ...) {
  class_list <- classification(ids, db = db, ...)
  lc_helper(ids, class_list)
}

lowest_common.uid <- function(ids, ...) {
  class_list <- classification(ids, db = "uid",  ...)
  lc_helper(ids, class_list)
}

lowest_common.tsn <- function(ids, ...) {
  class_list <- classification(ids, db = "itis", ...)
  lc_helper(ids, class_list)
}

lowest_common.gbifid <- function(ids, ...) {
  class_list <- classification(ids, db = "gbif", ...)
  lc_helper(ids, class_list)
}

lc_helper <- function(ids, class_list) {
  idsc <- class_list[ids]
  cseq <- vapply(idsc, function(x) x[1, 1] != "unclassified sequences", logical(1))
  idsc <- idsc[cseq]
  x_row <- length(Reduce(intersect, lapply(idsc, "[[", 1)))
  x <- idsc[[1]][x_row, ]
  if (x[1, "rank"] == "no rank") {
    x[1, "rank"] <- next_best_taxon(idsc[[1]][1:x_row, ])
  }
  return(x)
}

ids <- c("9031", "9823", "9606", "9470")
lowest_common(ids[2:4], db = "ncbi")
#>             name        rank      id
#> 21 Boreoeutheria below-class 1437010

spp <- c("Sus scrofa", "Homo sapiens", "Nycticebus coucang")
lowest_common(spp, db = "ncbi")
#>             name        rank      id
#> 21 Boreoeutheria below-class 1437010

lowest_common(get_uid(spp))
#>             name        rank      id
#> 21 Boreoeutheria below-class 1437010

spp <- c("Sus scrofa", "Homo sapiens", "Nycticebus coucang")
lowest_common(spp, db = "itis")
#>        name       rank     id
#> 10 Eutheria Infraclass 179925


lowest_common(get_tsn(spp))
#>        name       rank     id
#> 10 Eutheria Infraclass 179925


gbifids <- c("2704179", "3119195")
lowest_common(gbifids, db = "gbif")
#>            name   rank id
#> 2 Magnoliophyta phylum 49

spp <- c("Poa annua", "Helianthus annuus")
lowest_common(get_gbifid(spp))
#>            name   rank id
#> 2 Magnoliophyta phylum 49

#7

see also https://github.com/ropensci/taxize/issues/505


#8

Looks awesome! I think it’s a great start. If it seems useful (or if these are easy to write), some things I was considering adding:

  1. Give a warning when “unclassified sequences” are encountered. My feeling is that people should know when data aren’t being used

  2. Accept an argument specifying a character vector of taxonomic rank names (e.g. return_rank = c("class", "order", "family")), triggering either (a) return of the name of the LCA at each of these ranks, or (B) NA or something if the LCA is above this rank. (I might have left a TODO in the original code).

I have never used the issues functionality of github; but I’m happy to give it a shot if you prefer I pull from there to make edits. Might not get to it until later this week though.


#9

Just to clarify, what does LCA mean?

Sounds good, will only apply to NCBI I assume.

hmmm, this almost seems like a different use case, that is, would maybe better fit in a different function. It seems here you want taxonomic names at a specific rank or set of ranks, which seems different from wanting the lowest common rank - does that seem right?

If you’re willing to, otherwise I can make changes as we discuss them


#10

Sorry – LCA == lowest common ancestor. Using the term in the taxonomic sense here. E.g. the lowest common ancestor of Homo sapiens and Homo erectus is Homo.

Indeed; in my case they seem linked, and I imagined it being more efficient to get them at the same time for my purposes. But I do see the argument for keeping them separate. I defer to your opinion here.

Cool; I’ll give it a shot when I get back to this later in the week. I use command line git all the time, but rarely collaboratively.


#11

Thanks for clarification.

We’ll see about whether those can be in the same function or not. not sure yet.

Note the functions is in the package here https://github.com/ropensci/taxize/blob/master/R/lowest_common.R but not exported to users yet, just internal


#12

Gotcha.

I think those options would work in the same function by extracting them during or just after Intersect() is called. But again, you have a much better sense of best practice for building generalizable tools than I do!


#13

Let me know if you’re getting to this - or if I should keep working on it. Either way you’ll get credit for your contribution :slightly_smiling:


#14

Sweet; thanks! I can make some time this afternoon or tomorrow morning, but
feel free to forge ahead if you want to get this into the “finished” pile.


#15

Go for it. If no activity by mid next week or so, I’ll get working on it.


#16

Bueno. I just got some new results, so will likely end up digging into it
today.


#17

Just a shout that I’m working on this right now.


#18

awesome, ping me if you have any questions


#19

Alrighty then. Sorry in advance for the wall of text. I submitted a pull request with changes.

It is awesome getting to work with someone who (unlike me) knows what they’re doing. I have never dug into ‘proper’ organization of functions like this, which is really cool.

Three things:

# 1
The way you have arranged these, the call to classifications() is embedded within the function.

With the data I have in front of me, I need to get the lowest common ancestor from 471 sets of taxon ids from one study. This guarantees at least 471 calls to classification(), presumably each of which with at least two queries being passed. This can take an appreciable amount of time.

However, there is redundancy in the taxa across the sets: only 756 taxa are unique. Thus, it is far more efficient to make a single call to classification() using only non-redundant ids, then do subsetting of that list.

This might be a fringe case that’s not worth incorporating, but I expect it to be increasingly common in the coming years, thus I think the user should be able to supply a list (output of classification). I’m not sure where to stick that, given I’m not so familiar with the UseMethod or '...' syntax (I suspect it is a necessity in order to ensure generality across taxonomy databases and respective functions?). But I went for it anyway!

# 2
I incorporated my idea about forcing the function to report the name of a specified taxonomic rank shared across taxa. As submitted, this is the behavior I was looking for, but I understand if it might be best practice to keep them separate.

# 3
I am not very familiar with taxonomic databases besides NCBI. I used this line

cseq <- vapply(idsc, function(x) x[1, 1] != "unclassified sequences", logical(1))

To catch taxa whose top level classification is “unclassified sequences” – presumably sequences from bulk environmental samples. This may be a quirk specific to NCBI. Are you familiar with any such issues in other taxon databases?

I get an error when passing ID numbers as numeric:

lc_helper(c(8023, 74940), classifications)
# Error in vapply(idsc, function(x) x[1, 1] != "unclassified sequences",  : 
#   values must be length 1,
#  but FUN(X[[1]]) result is length 0

I suppose this is fine provided there is good documentation elsewhere that IDs should be passed as characters.

The same line generates an error if any of the taxa are not found in a database. I added some obscurities as tests to the bottom of the code.


#20

I added a test of provided rank name, but it requires getranknames() for every call to lc_helper. This would slow things down tremendously, but I’m not sure how else to go about this. Seems like it’d be nice to report (warning(), not stop()) if the user has supplied an invalid name, but happy to also leave it up to users to check that first.