1

Im trying to web-scrape the Obama's spechees page, to create things like wordclouds, etc. When I try to do it for like 1, 5, 10 different pages (speeches), not in a loop, separately, the code works. But with this loop I created (above), the resulting object contains nothing (NULL).

Somebody can help me, please?

library(wordcloud)
library(tm)
library(XML)
library(RCurl)

site <- "http://obamaspeeches.com/"
url <- readLines(site)

h <- htmlTreeParse(file = url, asText = TRUE, useInternalNodes = TRUE, 
    encoding = "utf-8")

# getting the phrases that will form the web adresses for the speeches
teste <- data.frame(h[42:269, ])
teste2 <- teste[grep("href=", teste$h.42.269...), ]
teste2 <- as.data.frame(teste2)
teste3 <- gsub("^.*href=", "", teste2[, "teste2"])
teste3 <- as.data.frame(teste3)
teste4 <- gsub("^/", "", teste3[, "teste3"])
teste4 <- as.data.frame(teste4)
teste5 <- gsub(">.*$", "", teste4[, "teste4"])
teste5 <- as.data.frame(teste5)

# loop to read pages

l <- vector(mode = "list", length = nrow(teste5))
i <- 1
for (i in nrow(teste5)) {
    site <- paste("http://obamaspeeches.com/", teste5[i, ], sep = "")
    url <- readLines(site)
    l[[i]] <- url
    i <- i + 1
}

str(l)
barny
  • 5,280
  • 4
  • 16
  • 21

1 Answers1

1

The rvest package makes this considerably simpler by scraping and parsing, although a bit of knowledge of CSS or XPath selectors can be necessary. It's a much better approach than using regex on HTML, which is discouraged in favor of a proper HTML parser (like rvest!).

If you're trying to scrape a bunch of sub-pages, you can make a vector of URLs, and then lapply across it to scrape and parse each page. The advantage of this approach (over a for loop) is that it returns a list with an item for each iteration, which will be much easier to deal with afterwards. If you want to go full-Hadleyverse, you can use purrr::map instead, which lets you turn it all into one big sequential chain.

library(rvest)

baseurl <- 'http://obamaspeeches.com/' 

         # For this website, get the HTML,
links <- baseurl %>% read_html() %>% 
    # select <a> nodes that are children of <table> nodes that are aligned left,
    html_nodes(xpath = '//table[@align="left"]//a') %>% 
    # and get the href (link) attribute of that node.
    html_attr('href')

            # Loop across the links vector, applying a function that
speeches <- lapply(links, function(url){
    # pastes the ULR to the base URL,
    paste0(baseurl, url) %>% 
    # fetches the HTML for that page,
    read_html() %>% 
    # selects <table> nodes with a width of 610,
    html_nodes(xpath = '//table[@width="610"]') %>% 
    # get the text, trimming whitespace on the ends,
    html_text(trim = TRUE) %>% 
    # and break the text back into lines, trimming excess whitespace for each.
    textConnection() %>% readLines() %>% trimws()
})
Community
  • 1
  • 1
alistaire
  • 38,696
  • 4
  • 60
  • 94
  • if I wanted to do a simliar process but was looking for keywords in multiple urls and wanted to pull back the keywords, would it be faster to run lapply with a function that uses regex, or is there something in rvest that is more efficient at parsing specific words? I'm running into an issue where it takes a really long time to process 50 or 100 urls. – Matt W. May 23 '17 at 13:31
  • Are you trying to parse the URL or the page it leads to? If the former it's a different question, but check out `httr::parse_url` or the like. – alistaire May 23 '17 at 16:23
  • I'm trying to parse the page it leads to and pull back all the html on that page. Or at least search the html on that page for keywords and pull back the keywords it finds. – Matt W. May 23 '17 at 20:18
  • You should probably ask a question, but to get you started, you can pull the HTML by using `lapply` or `purrr::map` to apply `read_html` across a vector of URLs. Where to go beyond that requires context. – alistaire May 23 '17 at 23:20