Rentrez - problem using the web_history object

This is probably a question for an NCBI E-utils person rather then the Rentrez wrapper writer or this community but I’m struggling to find enough info.

I’m trying to assemble a set of metadata for sequences deposited in Nuccore with Feature:Source /country=New Zealand, starting with the big list of IDs returned using a general query for ‘New Zealand’ (as it doesn’t seem possible to query the feature table directly). For that I need use_history=TRUE and step through the large set of results using the list stored on the history server. However, if I subsequently use the web_history object I don’t get the expected return data set.

To simplify, here is a E-util web query which returns a single nuccore record …
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=new+zealand+AND+ddbj_embl_genbank[filter]+AND+gerhardtia+pseudosaponacea[Organism]

If I use Rentrez and the history server then something like …

seq_NZ <- entrez_search(db = "nuccore", term=
                          "new+zealand AND ddbj_embl_genbank[filter] AND gerhardtia+pseudosaponacea[Organism]"
                        , retmax=0, use_history=TRUE)

Then if I use the web history object to fetch the record…

seqrecs <- entrez_fetch(db="nuccore", web_history=seq_NZ$web_history,rettype="xml", retmax=1, parsed=TRUE)

I get back what looks like the associated popset of records and not the single record (for the single queried ID) I was expecting. What silly simple thing am I doing wrong?

(Hope you don’t mind, I did a small edit to put code in code blocks. See Welcome to rOpenSci Discuss for some discussion of markdown)

paging @dwinter

Kia ora JC!

This does seem like something happening on the NCBI’s end. Playing around a little, it also effects “normal” (that is, non web-history queries). I don’t think sending one ID and getting back 8 records is an expected behaviour, so it’s worth letting them know about it.

Depending on what you want to get from the records I can suggest at least one workaround for now. If you download records in “gbc” format, which is a XML-ficaton of genbank you do get one record per ID. You can’t return a parsed object (for now, I’ll start an issue for this and other cases where XML records are not called XML), but it’s easy enough to create one and retrieve information from it:

seqrecs_gb_xml <- entrez_fetch(db="nuccore", web_history = seq_NZ$web_history,rettype="gbc")
parsed <- XML::xmlTreeParse(seqrecs_gb_xml, useInternalNodes=TRUE)
parsed["//INSDSeq_taxonomy"]
[[1]]
<INSDSeq_taxonomy>Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina; Agaricomycetes; Agaricomycetidae; Agaricales; Lyophyllaceae; Gerhardtia</INSDSeq_taxonomy> 

attr(,"class")
[1] "XMLNodeSet"

Hope that helps, and let me know if you have other questions.

David

EDIT to add a link to the issue regarding parsing differently named XML files on the fly: Rentrez - problem using the web_history object

Great response. Thank you. I will have a play!

1 Like