Using R to scrape the link address of a downloadable file from a web page?

Question

I'm trying to automate a process that involves downloading .zip files from a couple of web pages and extracting the .csvs they contain. The challenge is that the .zip file names, and thus the link addresses, change weekly or annually, depending on the page. Is there a way to scrape the current link addresses from those pages so I can then feed those addresses to a function that downloads the files?

One of the target pages is this one. The file I want to download is the second bullet under the header "2015 Realtime Complete All Africa File"---i.e., the zipped .csv. As I write, that file is labeled "Realtime 2015 All Africa File (updated 11th July 2015)(csv)" on the web page, and the link address that I want is http://www.acleddata.com/wp-content/uploads/2015/07/ACLED-All-Africa-File_20150101-to-20150711_csv.zip, but that should change later today, because the data are updated each Monday---hence my challenge.

I tried but failed to automate extraction of that .zip file name with 'rvest' and the selectorgadet extension in Chrome. Here's how that went:

> library(rvest)
> realtime.page <- "http://www.acleddata.com/data/realtime-data-2015/"
> realtime.html <- html(realtime.page)
> realtime.link <- html_node(realtime.html, xpath = "//ul[(((count(preceding-sibling::*) + 1) = 7) and parent::*)]//li+//li//a")
> realtime.link
[1] NA

The xpath in that call to html_node() came from highlighting just the (csv) portion of the Realtime 2015 All Africa File (updated 11th July 2015)(csv) field in green and then clicking on enough other highlighted bits of the page to eliminate all the yellow and leave only red and green.

Did I make a small mistake in that process, or am I just entirely on the wrong track here? As you can tell, I have zero experience with HTML and web-scraping, so I'd really appreciate some assistance.

Try `realtime.html %>% html_node(xpath = "/html/body/div/div/div/div[1]/div/article/div/ul[1]/li[2]/a") %>% html_attr("href")`. I used Firebug to extract the xpath. — lukeA, Jul 20 '15 at 13:04
Yes, that works, thank you very much. I will now go try figure out how to use firebug. — ulfelder, Jul 20 '15 at 13:21
@lukeA I was able to use Inspect Element in Chrome to see the html for the other page with a data set I want and, based on your debugged example, to figure out how to write an xpath that worked for it. So: thanks again! — ulfelder, Jul 20 '15 at 14:26
xpath can also be found in your browser by simply right clicking (on required element) and selecting. A window will open to the right with the element highlighted. Right click this element, go to copy and scroll down to xpath. — Nebulloyd, Mar 25 '20 at 17:09

score 18 · Accepted Answer · answered Jul 20 '15 at 19:47

18

I think you're trying to do too much in a single xpath expression - I'd attack the problem in a sequence of smaller steps:

library(rvest)
library(stringr)
page <- html("http://www.acleddata.com/data/realtime-data-2015/")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.xlsx") %>% # find those that end in xlsx
  .[[1]]                    # look at the first one

answered Jul 20 '15 at 19:47

hadley

94,313
27
170
239

Very efficient, thank you. For my task, I needed the url for the first .zip file, but it was easy to get that by replacing "\\.xlsx" with "\\.zip". Voila. – ulfelder Jul 20 '15 at 19:57
... a warning recommends to use 'xml2::read_html' instead html(). Work fine. – Tiziano Mar 19 '20 at 12:20

Using R to scrape the link address of a downloadable file from a web page?

1 Answers1

Linked