I am a novice at web scraping with R and I am stuck on this problem: I want to use R to submit a search query to PubMed, then download a CSV file from the results page. The CSV file can be accessed by clicking 'Send to', which opens a dropdown menu, then I need to select the 'File' radio button, change the 'Format' option to 'CSV' (option 6), and finally click the 'Create File' button to start the download.
A few notes:
1. Yes, this type of remote search and download complies with NCBI's policies.
2. Why don't you use the easyPubMed
package? I have already tried this and am using it for another portion of my work. However, using this package to retrieve search results misses some of the article metadata that the CSV download includes.
I have viewed these related issues: Download csv file from webpage after submitting form from dropdown using rvest package in R, R Download .csv file tied to input boxes and a "click" button, Using R to "click" a download file button on a webpage.
I feel that the previous solutions provided by @hrbrmstr contain the answer, but I just can't put the pieces together to download the CSV file.
I think the elegant solution to this problem is a two-step process: 1) POST
a search request to PubMed and GET
the results, and 2) submit a second POST
request to the results page (or somehow navigate within) with the desired options selected to download the CSV file. I have tried the following with a toy search query ("hello world", with quotes, which returns 6 results presently)...
query <- '"hello world"'
url <- 'https://www.ncbi.nlm.nih.gov/pubmed/'
html_form(html_session(url)) # enter query using 'term'
# post search and retrieve results
session <- POST(url,body = list(term=query),encode='form')
# scrape results to check that above worked
content(session) %>% html_nodes('#maincontent > div > div:nth-child(5)') %>%
html_text()
content(session) %>% html_nodes('#maincontent > div > div:nth-child(5)') %>%
html_nodes('p') %>% html_text()
# view html nodes of dropdown menu -- how to 'click' these via R?
content(session) %>% html_nodes('#sendto > a')
content(session) %>% html_nodes('#send_to_menu > fieldset > ul > li:nth-child(1) > label')
content(session) %>% html_nodes('#file_format')
content(session) %>% html_nodes('#submenu_File > button')
# submit request to download CSV file
POST(session$url, # I know this doesn't work, but I would hope something similar is possible
encode='form',
body=list('EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendTo'='File',
'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.FFormat'=6,
'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendToSubmit'=1),
write_disk('results.csv'))
The last line above fails -- a CSV file is downloaded but it contains the html results from the POST request. Ideally, how do I edit the last line to get the desired CSV file?
***A possible hack is skipping straight to the results page. In other words, I know that submitting the "hello world" search returns the following URL: https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22. So I can extrapolate from here and build the results URLs based on my search queries, if necessary.
I have tried inserting this URL into the line above, but it still doesn't return the desired CSV file. I can view the form fields using the command below...
# view form options on the results page
html_form(html_session('https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22'))
Or, can I expand the URL knowing the form options above? Something like...
url2 <- 'https://www.ncbi.nlm.nih.gov/pubmed/?term=%22hello+world%22&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendTo=File&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.FFormat=6&EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.SendToSubmit=1'
POST(url2,write_disk('results2.csv'))
I am expecting to download a CSV file with 6 results containing article metadata, however, I am getting the html of the results pages.
Any help is greatly appreciated! Thank you.