R: Using rvest package instead of XML package to get links from URL

Question

I use XML package to get the links from this url.

# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

While this method is very efficient, I've used rvest and seems faster at parsing a web than XML. I tried html_nodes and html_attrs but I can't get it to work.

`rvest` uses the `XML` package for the node extraction. It really shouldn't be any faster. — hrbrmstr, Dec 04 '14 at 17:18

hrbrmstr · Accepted Answer · 2016-06-07T11:58:42.317

Despite my comment, here's how you can do it with rvest. Note that we need to read in the page with htmlParse first since the site has the content-type set to text/plain for that file and that tosses rvest into a tizzy.

library(rvest)
library(XML)

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")

##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"  
##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"  
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"  
## [275] "/inf_corporativa98959_ZNC.html"

That further illustrates rvest's XML package underpinnings.

UPDATE

rvest::read_html() can handle this directly now:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

You are right, for the node extraction `rvest` uses `XML`. I'll discuss in the chat the difference in times for the sites on which I used the packages. Thanks for the reply. — capm, Dec 30 '14 at 06:02

Rich Scriven · Answer 2 · 2014-12-22T05:53:17.910

I know you're looking for an rvest answer, but here's another way using the XML package that might be more efficient than what you're doing.

Have you seen the getLinks() function in example(htmlParse)? I use this modified version from the examples to get href links. It's a handler function so we can collect the values as they are read, saving on memory and increasing efficiency.

links <- function(URL) 
{
    getLinks <- function() {
        links <- character()
        list(a = function(node, ...) {
                links <<- c(links, xmlGetAttr(node, "href"))
                node
             },
             links = function() links)
        }
    h1 <- getLinks()
    htmlTreeParse(URL, handlers = h1)
    h1$links()
}

links("http://www.bvl.com.pe/includes/empresas_todas.dat")
#  [1] "/inf_corporativa71050_JAIME1CP1A.html"
#  [2] "/inf_corporativa10400_INTEGRC1.html"  
#  [3] "/inf_corporativa66100_ACESEGC1.html"  
#  [4] "/inf_corporativa71300_ADCOMEC1.html"  
#  [5] "/inf_corporativa10250_HABITAC1.html"  
#  [6] "/inf_corporativa77900_PARAMOC1.html"  
#  [7] "/inf_corporativa77935_PUCALAC1.html"  
#  [8] "/inf_corporativa77600_LAREDOC1.html"  
#  [9] "/inf_corporativa21000_AIBC1.html"     
#  ...
#  ...

Great help, I didn't check the examples in `htmlParse`, but I modified my code with your suggestion. In this case `XML` works great but it takes longer to fetch historical prices from this [web](http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20100101&fec_fin=20141130&nemonico=SIDERC1) than `rvest` does. — capm, Dec 04 '14 at 16:22
Yes, from [this web](http://www.bvl.com.pe/includes/empresas_todas.dat) I tried to get all the links from the site, while on [this site](http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20100101&fec_fin=20141130&nemonico=SIDERC1) I tried to parse a table that contains historical prices for the SIDERC1 quote. I used `XML` on both sites but I could only use `rvest` on the latter. — capm, Dec 30 '14 at 05:23

RYO ENG Lian Hu · Answer 3 · 2015-01-29T19:32:52.540

2

# Option 1
library(RCurl)
getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')

# Option 2
library(rvest)
library(pipeR) # %>>% will be faster than %>%
html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")

edited Jan 29 '15 at 19:32

answered Jan 29 '15 at 19:26

RYO ENG Lian Hu

458
1
9
20

1

Option 1 seems not to work with the current version of RCurl anymore. – Karsten W. Mar 27 '17 at 17:17

score 0 · Answer 4 · answered Apr 26 '16 at 20:43

Richard's answer works for HTTP pages but not the HTTPS page I needed (Wikipedia). I substituted RCurl's getURL function as below:

library(RCurl)

links <- function(URL) 
{
  getLinks <- function() {
    links <- character()
    list(a = function(node, ...) {
      links <<- c(links, xmlGetAttr(node, "href"))
      node
    },
    links = function() links)
  }
  h1 <- getLinks()
  xData <- getURL(URL)
   htmlTreeParse(xData, handlers = h1)
  h1$links()
}

R: Using rvest package instead of XML package to get links from URL

4 Answers4

Linked