html_nodes returning two results for a link

Question

I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.

Note, the use of download.file is to get around my company's firewall, per this answer

library(dplyr)
library(rvest)

myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"

download.file(myurl, destfile = "eurofull.html")

content <- read_html("eurofull.html")

links <- content %>% 
  html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
  html_attr("href") %>% 
  data.frame()

# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))

Looking at the top of the dataframe

files$.[1:6]

[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing? 
sort=1&file=data%2Faact_ali01.tsv.gz    
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing? 
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing? 
sort=1&file=data%2Faact_ali02.tsv.gz    
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing? 
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing? 
sort=1&file=data%2Faact_eaa01.tsv.gz    
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing? 
sort=1&downfile=data%2Faact_eaa01.tsv.gz

The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.

If I download 1 and 2 and read the files into R, an identical check confirms they are the same.

Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?

That's the way they have built the site. Since the links are contained in a table, you could attempt to return only the one column of the table. https://community.rstudio.com/t/whats-the-most-interesting-use-of-rvest-youve-seen-in-the-wild/745/7 — seasmith, Apr 17 '18 at 07:22

score 1 · Accepted Answer · answered Apr 17 '18 at 09:05

As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:

html_nodes(content, xpath = ".//a[contains(@href, 'downfile')]") %>% 
  html_attr("href") %>% 
  sprintf("http://ec.europa.eu/%s", .) %>% 
  head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"

html_nodes returning two results for a link

1 Answers1