traits::ncbi_byname to DNAbin objects

sequences
ape
ncbi
r
Tags: #<Tag:0x00007f57f6093580> #<Tag:0x00007f57f6093440> #<Tag:0x00007f57f6093300> #<Tag:0x00007f57f6093058>

#1

A user asked via email how to take output of traits::ncbi_byname() and make DNAbin format objects.

DNAbin is from the ape package - there’s many things you can do with an object of class DNAbin once you have it from what I understand.

We have 3 functions in the traits package to search for NCBI sequence data

  • ncbi_byname() - Retrieve gene sequences from NCBI by taxon name and gene names
  • ncbi_byid() - Retrieve gene sequences from NCBI by accession number
  • ncbi_searcher() - Search for gene sequences from NCBI by taxon names, taxonomic IDs, and more

Using the ncbi_byname() function, we can do:

library(traits)
library(ape)
species <- c("Colletes similis","Halictus ligatus","Perdita californica")
out <- ncbi_byname(taxa=species, gene = c("coi", "co1"), seqrange = "1:2000")
bins <- lapply(out, function(z) {
  mat <- t(as.matrix(strsplit(tolower(z$sequence), "")[[1]]))
  rownames(mat) <- z$taxon
  as.DNAbin(mat)
})

Look at one of them

bins[[1]]
#> 1 DNA sequence in binary format stored in a matrix.
#> 
#> Sequence length: 658 
#> 
#> Label:
#> Colletes similis
#> 
#> Base composition:
#>     a     c     g     t 
#> 0.323 0.113 0.108 0.457

To write to a file you could do:

write.dna(bins[[1]], (f <- tempfile()))

The above makes a separate DNAbin object for each sequence. From what I understand i think we’d have to have the sequences all the same length to make a single DNAbin object with all sequences, but am I wrong on that?

Better ways to do this?


#2

This looks good to me. One small point, you can make DNAbin object where the sequences having different lengths. Using the code example about you can create a list of matrices:

seq_list <- lapply(out, function(z) {
  mat <- t(as.matrix(strsplit(tolower(z$sequence), "")[[1]]))
  rownames(mat) <- z$taxon
  mat
})
as.DNAbin(seq_list)
3 DNA sequences in binary format stored in a list.

Mean sequence length: 973.667 
   Shortest sequence: 658 
    Longest sequence: 1478 

Labels:  

Base composition:
    a     c     g     t 
0.348 0.121 0.103 0.428 

Functions that expect an alignment as input will either fail or behave strangely with this sort of DNAbin. But other functins work fine. For instance, it’s possible to write these records to file (as long as you specify the fasta format)

write.dna(as.DNAbin(seq_list), tempfile(), format="fasta")

#3

Ah perfect, thanks David!! This is great.