12

I'm trying to automate a process that involves downloading .zip files from a couple of web pages and extracting the .csvs they contain. The challenge is that the .zip file names, and thus the link addresses, change weekly or annually, depending on the page. Is there a way to scrape the current link addresses from those pages so I can then feed those addresses to a function that downloads the files?

One of the target pages is this one. The file I want to download is the second bullet under the header "2015 Realtime Complete All Africa File"---i.e., the zipped .csv. As I write, that file is labeled "Realtime 2015 All Africa File (updated 11th July 2015)(csv)" on the web page, and the link address that I want is http://www.acleddata.com/wp-content/uploads/2015/07/ACLED-All-Africa-File_20150101-to-20150711_csv.zip, but that should change later today, because the data are updated each Monday---hence my challenge.

I tried but failed to automate extraction of that .zip file name with 'rvest' and the selectorgadet extension in Chrome. Here's how that went:

> library(rvest)
> realtime.page <- "http://www.acleddata.com/data/realtime-data-2015/"
> realtime.html <- html(realtime.page)
> realtime.link <- html_node(realtime.html, xpath = "//ul[(((count(preceding-sibling::*) + 1) = 7) and parent::*)]//li+//li//a")
> realtime.link
[1] NA

The xpath in that call to html_node() came from highlighting just the (csv) portion of the Realtime 2015 All Africa File (updated 11th July 2015)(csv) field in green and then clicking on enough other highlighted bits of the page to eliminate all the yellow and leave only red and green.

Did I make a small mistake in that process, or am I just entirely on the wrong track here? As you can tell, I have zero experience with HTML and web-scraping, so I'd really appreciate some assistance.

ulfelder
  • 4,898
  • 1
  • 18
  • 32
  • 2
    Try `realtime.html %>% html_node(xpath = "/html/body/div/div/div/div[1]/div/article/div/ul[1]/li[2]/a") %>% html_attr("href")`. I used Firebug to extract the xpath. – lukeA Jul 20 '15 at 13:04
  • Yes, that works, thank you very much. I will now go try figure out how to use firebug. – ulfelder Jul 20 '15 at 13:21
  • @lukeA I was able to use Inspect Element in Chrome to see the html for the other page with a data set I want and, based on your debugged example, to figure out how to write an xpath that worked for it. So: thanks again! – ulfelder Jul 20 '15 at 14:26
  • xpath can also be found in your browser by simply right clicking (on required element) and selecting. A window will open to the right with the element highlighted. Right click this element, go to copy and scroll down to xpath. – Nebulloyd Mar 25 '20 at 17:09

1 Answers1

18

I think you're trying to do too much in a single xpath expression - I'd attack the problem in a sequence of smaller steps:

library(rvest)
library(stringr)
page <- html("http://www.acleddata.com/data/realtime-data-2015/")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.xlsx") %>% # find those that end in xlsx
  .[[1]]                    # look at the first one
hadley
  • 94,313
  • 27
  • 170
  • 239
  • Very efficient, thank you. For my task, I needed the url for the first .zip file, but it was easy to get that by replacing "\\.xlsx" with "\\.zip". Voila. – ulfelder Jul 20 '15 at 19:57
  • ... a warning recommends to use 'xml2::read_html' instead html(). Work fine. – Tiziano Mar 19 '20 at 12:20