11

I use XML package to get the links from this url.

# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

While this method is very efficient, I've used rvest and seems faster at parsing a web than XML. I tried html_nodes and html_attrs but I can't get it to work.

hrbrmstr
  • 71,487
  • 11
  • 119
  • 180
capm
  • 835
  • 3
  • 15
  • 23

4 Answers4

16

Despite my comment, here's how you can do it with rvest. Note that we need to read in the page with htmlParse first since the site has the content-type set to text/plain for that file and that tosses rvest into a tizzy.

library(rvest)
library(XML)

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat")
pg %>% html_nodes("a") %>% html_attr("href")

##   [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html"  
##   [3] "/inf_corporativa66100_ACESEGC1.html"   "/inf_corporativa71300_ADCOMEC1.html"  
## ...
## [273] "/inf_corporativa64801_VOLCAAC1.html"   "/inf_corporativa58501_YURABC11.html"  
## [275] "/inf_corporativa98959_ZNC.html"  

That further illustrates rvest's XML package underpinnings.

UPDATE

rvest::read_html() can handle this directly now:

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")
hrbrmstr
  • 71,487
  • 11
  • 119
  • 180
  • You are right, for the node extraction `rvest` uses `XML`. I'll discuss in the chat the difference in times for the sites on which I used the packages. Thanks for the reply. – capm Dec 30 '14 at 06:02
4

I know you're looking for an rvest answer, but here's another way using the XML package that might be more efficient than what you're doing.

Have you seen the getLinks() function in example(htmlParse)? I use this modified version from the examples to get href links. It's a handler function so we can collect the values as they are read, saving on memory and increasing efficiency.

links <- function(URL) 
{
    getLinks <- function() {
        links <- character()
        list(a = function(node, ...) {
                links <<- c(links, xmlGetAttr(node, "href"))
                node
             },
             links = function() links)
        }
    h1 <- getLinks()
    htmlTreeParse(URL, handlers = h1)
    h1$links()
}

links("http://www.bvl.com.pe/includes/empresas_todas.dat")
#  [1] "/inf_corporativa71050_JAIME1CP1A.html"
#  [2] "/inf_corporativa10400_INTEGRC1.html"  
#  [3] "/inf_corporativa66100_ACESEGC1.html"  
#  [4] "/inf_corporativa71300_ADCOMEC1.html"  
#  [5] "/inf_corporativa10250_HABITAC1.html"  
#  [6] "/inf_corporativa77900_PARAMOC1.html"  
#  [7] "/inf_corporativa77935_PUCALAC1.html"  
#  [8] "/inf_corporativa77600_LAREDOC1.html"  
#  [9] "/inf_corporativa21000_AIBC1.html"     
#  ...
#  ...
Rich Scriven
  • 90,041
  • 10
  • 148
  • 213
  • Great help, I didn't check the examples in `htmlParse`, but I modified my code with your suggestion. In this case `XML` works great but it takes longer to fetch historical prices from this [web](http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20100101&fec_fin=20141130&nemonico=SIDERC1) than `rvest` does. – capm Dec 04 '14 at 16:22
  • Prices? Your question says you're trying to get the links – Rich Scriven Dec 22 '14 at 05:58
  • Yes, from [this web](http://www.bvl.com.pe/includes/empresas_todas.dat) I tried to get all the links from the site, while on [this site](http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20100101&fec_fin=20141130&nemonico=SIDERC1) I tried to parse a table that contains historical prices for the SIDERC1 quote. I used `XML` on both sites but I could only use `rvest` on the latter. – capm Dec 30 '14 at 05:23
2
# Option 1
library(RCurl)
getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')

# Option 2
library(rvest)
library(pipeR) # %>>% will be faster than %>%
html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")
RYO ENG Lian Hu
  • 458
  • 1
  • 9
  • 20
0

Richard's answer works for HTTP pages but not the HTTPS page I needed (Wikipedia). I substituted RCurl's getURL function as below:

library(RCurl)

links <- function(URL) 
{
  getLinks <- function() {
    links <- character()
    list(a = function(node, ...) {
      links <<- c(links, xmlGetAttr(node, "href"))
      node
    },
    links = function() links)
  }
  h1 <- getLinks()
  xData <- getURL(URL)
   htmlTreeParse(xData, handlers = h1)
  h1$links()
}
bshor
  • 4,101
  • 8
  • 21
  • 30