3

I'm scraping this website using the "rvest"-package. When I iterate my function too many times I get "Error in open.connection(x, "rb") : Timeout was reached". I have searched for similar questions but the answers seems to lead to dead ends. I have a suspicion that it is server side and the website has a build-in restriction on how many times I can visit the page. How do investigate this hypothesis?

The code: I have the links to the underlying web pages and want to construct a data frame with the information extracted from the associated web pages. I have simplified my scraping function a bit as the problem is still occurring with a simpler function:

scrape_test = function(link) {

  slit <-  str_split(link, "/") %>%
    unlist()
  id <- slit[5]
  sem <- slit[6]

  name <- link %>% 
    read_html(encoding = "UTF-8") %>%
    html_nodes("h2") %>%
    html_text() %>%
    str_replace_all("\r\n", "") %>%
    str_trim()

  return(data.frame(id, sem, name))
}

I use the purrr-package map_df() to iterate the function:

test.data = links %>%
  map_df(scrape_test)

Now, if I iterate the function using only 50 links I receive no error. But when I increase the number of links I encounter the before-mentioned error. Furthermore I get the following warnings:

  • "In bind_rows_(x, .id) : Unequal factor levels: coercing to character"
  • "closing unused connection 4 (link)"

EDIT: The following code making an object of links can be used to reproduce my results:

links <- c(rep("http://karakterstatistik.stads.ku.dk/Histogram/NMAK13032E/Winter-2013/B2", 100))
ScrapeGoat
  • 37
  • 1
  • 9
  • I forgot a "link %>%" as I simplified my function directly in stack overflow, sorry about that. The problem lies in the multiple iterations where it suddenly gives the error when you reach too many links. – ScrapeGoat Aug 20 '16 at 16:57
  • Maybe try inserting a `Sys.sleep(x)` (insert your `x` of choice) into your function to pause in between requests? – Weihuang Wong Aug 20 '16 at 17:02
  • Yes, that would definitely be a possible solution. How do I know what "x" should be? Are there any tricks or is it trial and error? However, as I want to iterate 1000 links as a start and possible 32000 I probably need another solution since setting x=2 would take a long while to run. – ScrapeGoat Aug 20 '16 at 17:13

1 Answers1

11

With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:

d <- vector("list", length(links))

Here I do a for-loop, with a tryCatch block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d))) so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.

for (i in seq_along(links)) {
  if (!(links[i] %in% names(d))) {
    cat(paste("Doing", links[i], "..."))
    ok <- FALSE
    counter <- 0
    while (ok == FALSE & counter <= 5) {
      counter <- counter + 1
      out <- tryCatch({                  
                  scrape_test(links[i])
                },
                error = function(e) {
                  Sys.sleep(2)
                  e
                }
              )
      if ("error" %in% class(out)) {
        cat(".")
      } else {
        ok <- TRUE
        cat(" Done.")
      }
    }
    cat("\n")
    d[[i]] <- out
    names(d)[i] <- links[i]
  }
} 
Weihuang Wong
  • 11,980
  • 2
  • 22
  • 45
  • Whoa, that function is nice! I just ran the code but I hit a continuous error so had to break it. How do I identify where it broke? And how do I start it from the breaking point. I dont think I understand the `if (!(links[i] %in% names(d)))`. – ScrapeGoat Aug 20 '16 at 18:58
  • Because we're using a for-loop, an object `i` is created in the global environment and updated with each iteration. So if we break the loop manually, we can inspect the last value of `i` by simply calling `i` from the console; hence `links[i]` will return the link where the loop broke. The `if(...)` says only do all the stuff in the encapsulated block if `link[i]` does not already exist in the `names` of list elements; and we only set `names(d)[i] – Weihuang Wong Aug 20 '16 at 19:10
  • Okay, so if I break it manually, it automatically continue where it broke? That is so cool, whoa! How do I convert the created vector to a data frame the easiest way? – ScrapeGoat Aug 20 '16 at 19:25
  • And too bad I can't upvote you answer! It is _really really_ great! Thanks alot – ScrapeGoat Aug 20 '16 at 19:28
  • `do.call(rbind, d)` (where `d` is the name of the list). But this only works if every element in `d` is a dataframe with the same number of columns. E.g. it won't work if, as in the code above, we assign an "error" object to the list element if `scrape_test()` is still failing after 5 attempts, and the list is a mix of dataframes and error objects. In this case, you have to remove the error objects by iterating through the list before you call `do.call(rbind, ...)`. – Weihuang Wong Aug 20 '16 at 19:31
  • It is impossible for me to convert the data frame list into one single data frame as I can't seem to exclude non data frame elements. Even when I try iteration. Should I post a new question? – ScrapeGoat Aug 20 '16 at 21:45
  • There really should be a built-in retry setting. It's a bit silly one has to do this. – CoderGuy123 Dec 01 '16 at 06:58