Chunking in R

TL;DR : “Chunking” a vector can facilitate processing or, as in the example below, serve as a solution for API query limits

E-Utilities provides an intercace (API) for accessing NCBI databases such as PubMed, GenBank, etc. There are a variety of ways (clients) to leverage this service. Since I typically develop in R, I’ve been using the rentrez package.

The API limits are typically pretty forgiving, and there are even methods to store a “web history” for search results so you don’t have to query again to retrieve them.

That said, I have encountered an issue with trying to harvest summary XML files for a list of > 1000 records.

The workflow for retrieving these “esummary” documents is:

  1. Retrieve a list of IDs for your search term(s) with entrez_search()
  2. Use that list of IDs to retrieve summaries for each record with entrez_summary()
  3. Extract field(s) of interest from each summary with extract_from_esummary()

The process above is detailed in the code below:


results <- entrez_search("pubmed", term = "zika", retmax = 9999)

esummaries <- entrez_summary("pubmed", id = results$ids)

journals <- extract_from_esummary(esummaries, elements = "fulljournalname")

Unfortunately for a query that returns a large number of results (i.e. “zika”) the entrez_summary() command will fail with something like:

Error in entrez_check(response) : HTTP failure 414, the request is too large. For large requests, try using web history as described in the tutorial

Now for the “chunking” …

The key here is to partition the single request into a series of multiple requests so the database can handle them one at a time.

The code that follows implements a technique for splitting the vector of results (IDs) into groups of 500, and is based on a Stack Overflow answer to a similar question.


# search for articles on zika in pubmed

results <- entrez_search("pubmed", term = "zika", retmax = 9999)

# create an index that splits
bigqueryindex <- 
    split(seq(1,length(results$ids)), ceiling(seq_along(seq(1,length(results$ids)))/500))

# create an empty list to hold summary contents
esummaries <- list()

# loop through the list of ids 500 at a time and pause for 5 seconds in between queries
for (i in bigqueryindex) {
esummaries[unlist(i)] <- entrez_summary("pubmed", id = results$ids[unlist(i)])

# apply over the list to extract the fulljournalname field from the esummary
sapply(esummaries, function(x) extract_from_esummary(x, elements = "fulljournalname"))
