4

I am a novice at web scraping with R and I am stuck on this problem: I want to use R to submit a search query to PubMed, then download a CSV file from the results page. The CSV file can be accessed by clicking 'Send to', which opens a dropdown menu, then I need to select the 'File' radio button, change the 'Format' option to 'CSV' (option 6), and finally click the 'Create File' button to start the download.

A few notes:
1. Yes, this type of remote search and download complies with NCBI's policies.
2. Why don't you use the easyPubMed package? I have already tried this and am using it for another portion of my work. However, using this package to retrieve search results misses some of the article metadata that the CSV download includes.

I have viewed these related issues: Download csv file from webpage after submitting form from dropdown using rvest package in R, R Download .csv file tied to input boxes and a "click" button, Using R to "click" a download file button on a webpage.

I feel that the previous solutions provided by @hrbrmstr contain the answer, but I just can't put the pieces together to download the CSV file.

I think the elegant solution to this problem is a two-step process: 1) POST a search request to PubMed and GET the results, and 2) submit a second POST request to the results page (or somehow navigate within) with the desired options selected to download the CSV file. I have tried the following with a toy search query ("hello world", with quotes, which returns 6 results presently)...

query <- '"hello world"'
url <- 'https://www.ncbi.nlm.nih.gov/pubmed/'

html_form(html_session(url)) # enter query using 'term'
# post search and retrieve results
session <- POST(url,body = list(term=query),encode='form')

# scrape results to check that above worked
content(session) %>% html_nodes('#maincontent > div > div:nth-child(5)') %>% 
  html_text()
content(session) %>% html_nodes('#maincontent > div > div:nth-child(5)') %>% 
  html_nodes('p') %>% html_text()

# view html nodes of dropdown menu -- how to 'click' these via R?
content(session) %>% html_nodes('#sendto > a')
content(session) %>% html_nodes('#send_to_menu > fieldset > ul > li:nth-child(1) > label')
content(session) %>% html_nodes('#file_format')
content(session) %>% html_nodes('#submenu_File > button')

# submit request to download CSV file
POST(session$url, # I know this doesn't work, but I would hope something similar is possible
     encode='form',
     body=list('EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendTo'='File',
               'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.FFormat'=6,
               'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendToSubmit'=1),
     write_disk('results.csv'))

The last line above fails -- a CSV file is downloaded but it contains the html results from the POST request. Ideally, how do I edit the last line to get the desired CSV file?

***A possible hack is skipping straight to the results page. In other words, I know that submitting the "hello world" search returns the following URL: https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22. So I can extrapolate from here and build the results URLs based on my search queries, if necessary.

I have tried inserting this URL into the line above, but it still doesn't return the desired CSV file. I can view the form fields using the command below...

# view form options on the results page
html_form(html_session('https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22'))

Or, can I expand the URL knowing the form options above? Something like...

url2 <- 'https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendTo=File&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.FFormat=6&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendToSubmit=1'
POST(url2,write_disk('results2.csv'))

I am expecting to download a CSV file with 6 results containing article metadata, however, I am getting the html of the results pages.

Any help is greatly appreciated! Thank you.

kstew
  • 1,048
  • 5
  • 21
  • 1
    I think using the `middlechild` pkg and mitmproxy might be the answer to finding how to post the right request, as per the [YT video](https://www.youtube.com/watch?v=thr0vFRtK5g) suggested by @hrbrmstr. But I am wary of using mitmproxy for this -- any suggestions or comments on using this? – kstew Jul 16 '19 at 03:12
  • using `RSelenium` could answer the need to click on the dropdown and click the radiobuttons maybe – denis Aug 13 '19 at 21:24

2 Answers2

1

If I reframe your question to: "I want to use R to submit a search query to PubMed and then download information that is the same as what is provided in the CSV download option on the results page."

Then, I think you can skip the scraping and web UI automation and go directly to the API that NIH has provided for this purpose.

The first portion of this R code conducts the same search ("hello world") and gets the same results in JSON format (feel free to paste the search_url link in a browser to verify).

library(httr)
library(jsonlite)
library(tidyverse)

# Search for "hello world"
search_url <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=%22hello+world%22&format=json"

# Search for results
search_result <- GET(search_url)

# Extract the content
search_content <- content(search_result, 
                          type = "application/json",
                          simplifyVector = TRUE)

# search_content$esearchresult$idlist
# [1] "29725961" "28103545" "27567633" "25955529" "22999052" "19674957"

# Get a vector of the search result IDs
result_ids <- search_content$esearchresult$idlist

# Get a summary for id 29725961 (the first one).
summary_url <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&version=2.0&id=29725961&format=json" # 

summary_result <- GET(summary_url)

# Extract the content
summary_content <- content(summary_result, 
                          type = "application/json")

Presumably, you could take it from here since the list summary_content has the information you need, just in a different format (I verified by visual inspection).

However, in an attempt to comply with the spirit of your original question (gimme a CSV, using R, by pulling from NCBI), here are some of the steps you could use to reproduce the exact same CSV that you could get from the PubMed Web UI for humans.

# Quickie cleanup (thanks to Tony ElHabr)
# https://www.r-bloggers.com/converting-nested-json-to-a-tidy-data-frame-with-r/
summary_untidy <- enframe(unlist(summary_content))

# Get rid of *some* of the fluff...
summary_tidy <- summary_untidy %>% 
  filter(grepl("result.29725961", name)) %>% 
  mutate(name = sub("result.29725961.", "", name))

# Convert the multiple author records into a single comma-separated string.
authors <- summary_tidy %>% 
  filter(grepl("^authors.name$", name)) %>% 
  summarize(pasted = paste(value, collapse = ", "))

# Begin to construct a data frame that has the same information as the downloadable CSV
summary_csv <- tibble(
  Title = summary_tidy %>% filter(name == "title") %>% pull(value),
  URL = sprintf("/pubmed/%s", summary_tidy %>% filter(name == "uid") %>% pull(value)),
  Description = pull(authors, pasted),
  Details = "... and so on, and so on, and so on... "
)

# Write the sample data frame to a csv.
write_csv(summary_csv, path = "just_like_the_search_page_csv.csv")

I'm was not familiar with the easyPubMed package that you mentioned, but I was inspired to use the NCBI API by digging through the easyPubMed code. It is entirely possible that you could fix/adapt some of the easyPubMed code to pull the additional metadata that you're hoping to get from pulling a bunch of CSVs. (There isn't a lot there. It is only 500 lines of code that define 8 functions.)

Heck, if you do manage to adapt the easyPubMed code to extract the additional metadata, I'd recommend giving your changes back to the authors so they can improve their package!

D. Woods
  • 2,604
  • 3
  • 23
  • 36
  • 1
    Hi @D. Woods, thanks for your thoughtful answer. I had not considered this approach. I am trying it out now to see if it will work for my purposes. Note, I have to add "&retmax=10000" or some other large number to the first URL so that all the Pubmed ID's are returned (i.e., my actual searches are returning 1000+ hits). – kstew Aug 08 '19 at 16:27
  • @kstew, I hope it works for you. I didn't notice it in the API documentation, but many APIs handle the 1000s of hits problem by providing a page at a time (e.g., 1000 per page or something like that). Then, you just need to keep pulling the next page until you're done. – D. Woods Aug 08 '19 at 21:28
  • The API is only returning 20 results per page. So I think the easiest solution for me will be to add 'retmax' and return all the results at once. My problem now is that the API will only allow me to `GET(summary_url)` 3 times, i.e., I need to loop over `GET()` for each result, but the API limits the requests. Do you know a workaround for this or how to delay the `GET()` requests to meet the API limit? – kstew Aug 08 '19 at 21:43
  • @kstew I don't have any experience with this particular API, but with others that I've pulled data from, it is not uncommon for me to use a `Sys.sleep(5)` between `GET()` calls. Some APIs explicitly specify the maximum number of calls per minute in their documentation. If this one doesn't, it might be worth reaching out to the humans behind the API to see if they know (and would publish) that information. – D. Woods Aug 09 '19 at 02:56
1

Using the easyPubMed package:

library(easyPubMed)
out <- batch_pubmed_download(pubmed_query_string = "hello world")
DF <- table_articles_byAuth(pubmed_data = out[1])
write.csv(DF, "helloworld.csv")

See the vignette and help files in easyPubMed for more info.

Other packages are pubmed.mineR, rentrez and RISmed on CRAN, annotate on Bioconductor and Rcupcake on github.

G. Grothendieck
  • 211,268
  • 15
  • 177
  • 297
  • Thanks, @g-grothendieck. I probably should've explored the `easyPubMed` functions a bit more before rolling my own solution/answer. – D. Woods Aug 09 '19 at 02:58
  • Hi Grothendieck, thanks for the answer. Unfortunately, this is what I initially tried (and indicated in my question) but this method does not include the additional metadata included in the CSV. Thanks for suggesting other packages, which I will look into. – kstew Aug 09 '19 at 15:26