TL;DR : “Chunking” a vector can facilitate processing or, as in the example below, serve as a solution for API query limits
E-Utilities provides an intercace (API) for accessing NCBI databases such as PubMed, GenBank, etc. There are a variety of ways (clients) to leverage this service. Since I typically develop in R, I’ve been using the rentrez package.
The API limits are typically pretty forgiving, and there are even methods to store a “web history” for search results so you don’t have to query again to retrieve them.
That said, I have encountered an issue with trying to harvest summary XML files for a list of > 1000 records.
The workflow for retrieving these “esummary” documents is:
- Retrieve a list of IDs for your search term(s) with
entrez_search()
- Use that list of IDs to retrieve summaries for each record with
entrez_summary()
- Extract field(s) of interest from each summary with
extract_from_esummary()
The process above is detailed in the code below:
library(rentrez)
results <- entrez_search("pubmed", term = "zika", retmax = 9999)
esummaries <- entrez_summary("pubmed", id = results$ids)
journals <- extract_from_esummary(esummaries, elements = "fulljournalname")
Unfortunately for a query that returns a large number of results (i.e. “zika”) the entrez_summary()
command will fail with something like:
Error in entrez_check(response) : HTTP failure 414, the request is too large. For large requests, try using web history as described in the tutorial
Now for the “chunking” …
The key here is to partition the single request into a series of multiple requests so the database can handle them one at a time.
The code that follows implements a technique for splitting the vector of results (IDs) into groups of 500, and is based on a Stack Overflow answer to a similar question.
library(rentrez)
# search for articles on zika in pubmed
results <- entrez_search("pubmed", term = "zika", retmax = 9999)
# create an index that splits
bigqueryindex <-
split(seq(1,length(results$ids)), ceiling(seq_along(seq(1,length(results$ids)))/500))
# create an empty list to hold summary contents
esummaries <- list()
# loop through the list of ids 500 at a time and pause for 5 seconds in between queries
for (i in bigqueryindex) {
esummaries[unlist(i)] <- entrez_summary("pubmed", id = results$ids[unlist(i)])
Sys.sleep(5)
}
# apply over the list to extract the fulljournalname field from the esummary
sapply(esummaries, function(x) extract_from_esummary(x, elements = "fulljournalname"))