traits::ncbi_byname to DNAbin objects

sckott · September 21, 2017, 11:28pm

A user asked via email how to take output of traits::ncbi_byname() and make DNAbin format objects.

DNAbin is from the ape package - there’s many things you can do with an object of class DNAbin once you have it from what I understand.

We have 3 functions in the traits package to search for NCBI sequence data

ncbi_byname() - Retrieve gene sequences from NCBI by taxon name and gene names
ncbi_byid() - Retrieve gene sequences from NCBI by accession number
ncbi_searcher() - Search for gene sequences from NCBI by taxon names, taxonomic IDs, and more

Using the ncbi_byname() function, we can do:

library(traits)
library(ape)
species <- c("Colletes similis","Halictus ligatus","Perdita californica")
out <- ncbi_byname(taxa=species, gene = c("coi", "co1"), seqrange = "1:2000")
bins <- lapply(out, function(z) {
  mat <- t(as.matrix(strsplit(tolower(z$sequence), "")[[1]]))
  rownames(mat) <- z$taxon
  as.DNAbin(mat)
})

Look at one of them

bins[[1]]
#> 1 DNA sequence in binary format stored in a matrix.
#> 
#> Sequence length: 658 
#> 
#> Label:
#> Colletes similis
#> 
#> Base composition:
#>     a     c     g     t 
#> 0.323 0.113 0.108 0.457

To write to a file you could do:

write.dna(bins[[1]], (f <- tempfile()))

The above makes a separate DNAbin object for each sequence. From what I understand i think we’d have to have the sequences all the same length to make a single DNAbin object with all sequences, but am I wrong on that?

Better ways to do this?

dwinter · September 22, 2017, 12:03am

This looks good to me. One small point, you can make DNAbin object where the sequences having different lengths. Using the code example about you can create a list of matrices:

seq_list <- lapply(out, function(z) {
  mat <- t(as.matrix(strsplit(tolower(z$sequence), "")[[1]]))
  rownames(mat) <- z$taxon
  mat
})
as.DNAbin(seq_list)

3 DNA sequences in binary format stored in a list.

Mean sequence length: 973.667 
   Shortest sequence: 658 
    Longest sequence: 1478 

Labels:  

Base composition:
    a     c     g     t 
0.348 0.121 0.103 0.428

Functions that expect an alignment as input will either fail or behave strangely with this sort of DNAbin. But other functins work fine. For instance, it’s possible to write these records to file (as long as you specify the fasta format)

write.dna(as.DNAbin(seq_list), tempfile(), format="fasta")

sckott · September 22, 2017, 12:20am

Ah perfect, thanks David!! This is great.

Topic	Replies	Views
Traits v0.1 now on CRAN Package Use Questions	1055	June 19, 2015
Combining occurrence data from rgbif::occ_data/occ_search UseCases r , rgbif	1379	April 12, 2019
A rentrez paper, and how to use the NCBI's new API keys Blog r , literature , community , ncbi , data-access	696	March 20, 2018
Checklist Recipe - How we created a template to standardize species data Blog r , rgbif , gbif	761	November 20, 2018
Taxize v0.6 is on CRAN Package Use Questions	1288	June 19, 2015

traits::ncbi_byname to DNAbin objects

Related topics