neotoma & taxize - resolve taxon names from the Neotoma Paleoecological Database

neotoma
taxonomy
taxize
Tags: #<Tag:0x00007f57f62717f8> #<Tag:0x00007f57f6271690> #<Tag:0x00007f57f6271550>

#1

I wanted to get full taxonomic resolution for Neotoma named taxa from ITIS.

library(neotoma)
library(taxize)

The big taxonomy table in Neotoma is organized in a hierarchical fashion, but often based (for the plant taxa in particular) on a morphological hierarchy, rather than a taxonomic/phylogenetic hierarchy.

neotoma_taxa <- neotoma::get_table("taxa")

This then gives us the full taxonomic table in Neotoma, but there’s some weird taxonomy names in there, especially given the number of cf. taxa or undifferentiated types (identified as undiff.). I use a straightforward regular expression replacement to get rid of most of them. As with most things in Neotoma, 80% of the data is fairly straightforward, 10% can be dealt with with minor fixes, and then 10% is really “special case” data. The regular expression deals with 10% of the data.

get_class <- function(x) {
  
  # This just clears up some of the "uncertainty" fields in the taxon names.
  # This doesn't catch things like "sensu stricto" and others.
  taxa <- gsub("(\\?|\\-type|cf\\.\\s|aff\\.|\\sundiff\\.)", "", x, perl=TRUE)
  
  taxize::classification(taxa, db="itis", rows = 1)
  
}

Then this is all looped:

all_taxa_list <- list()

for (i in (i-1):nrow(neotoma_taxa)) {
  
  cat(paste0(neotoma_taxa$TaxaGroupID[i], ": ", neotoma_taxa$TaxonName[i], 
             ' - ', round(i/nrow(neotoma_taxa) * 100, 2), '% complete . . . '))
             
  all_taxa_list[[i]] <- list(neotoma_taxa[i,],
                             suppressMessages(get_class(neotoma_taxa$TaxonName[i])))
  
  if (!is.na(all_taxa_list[[i]][[2]])) {
    cat('Success!\n')
  } else {
    cat('Ugh. :(\n')
  }
  
  # Output throughout the run, it's slower, but then we don't run into issues later.
  saveRDS(all_taxa_list, file = "all_taxa.RDS")
}

Suggestions are appreciated, but it works fine :slight_smile:

It’s looped, instead of vectorizing the input to classification because there are so many taxa being passed in. I didn’t want a time-out to crash things unrecoverably.