Taxonomic databases from R

taxize is a taxonomic toolbelt we work on - it gets data from open web APIs and does a variety of tasks

however, some use cases really want all the data!

web APIs are great, but they aren’t great when you want all the data, or just more data than the API can deliver in the time your want it. This is a relative thing - some APIs are quite fast, and some are slow. The slower it is, the sooner you hit the this is way to slow, get me out of here point, and you really just want a database dump

So taxizedb

  • covers: ITIS, Theplantlist, COL (to be covered: NCBI, maybe others)
  • downloads SQL databases
  • loads SQL DBs into MySQL or PostgreSQL
  • creates src objects that you can plug into dplyr for easy database queries/manipulation

Let me know what you think. I’m sure there will be

A bit of history: We’ve been trying to integrate use of locally stored SQL DBs in taxize for a while now, see https://github.com/ropensci/taxize/issues?q=label%3Asql+is%3Aclosed - but it just seems a bit intractable given the complexity of the package already, and then on top of that adding SQL dependency packages, and the fact that very few of the DBs we could replicate web API calls with

p.s., maybe a better name is appropriate

Hi, i keep trying to download records species by species using occ_data. I am still finding some difficulties that might be of general interest. Basically i often do not get data from the species that i am asking for, but of a different one. For example,

key <- name_backbone(name=“Xantusia arizonae”, kingdom=‘animals’)$speciesKey
if(is.null(key)==F){
r=occ_data(taxonKey=key, hasCoordinate=TRUE,
limit=limit,basisOfRecord=“PRESERVED_SPECIMEN”)
FF <- as.matrix(r$data[,cols])

Will give data from Xantusia vigilis. those have been synonims in the past but it seems clear that they are not since the last 15 years.

http://reptile-database.reptarium.cz/species?genus=Xantusia&species=arizonae

A similar thing happened for 54 of 178 species required. That difficults a bit the use of the data repository because often you have data per species that you want to associate to the distribution of the requested species. I also raises doubts of the validitiy of the downloaded records (Am I getting records from those species or from old considered synonims?). Is there any work around or initiative to mitigate this problem?

Cheers and thanks for your helpful work!
Agus

can you please put this in a new issue here https://github.com/ropensci/rgbif/issues ? thanks!