8

Simple question: this code x <- read_html(url) hangs and reads page infinite amount of seconds. I don't know how to handle this, for example, by setting some maximum time for response. I could use try, catch, whatever to retry. But it just hangs and nothing happens. Anyone know how to deal with it?

There's no problem with page, it occurs sometimes, and while I retry manually it works.

JJJ
  • 939
  • 6
  • 17
  • 28
Peter.k
  • 1,264
  • 13
  • 29
  • Possible duplicate of [Comatose web crawler in R (w/ rvest)](https://stackoverflow.com/questions/32883512/comatose-web-crawler-in-r-w-rvest) – Kim Jun 10 '18 at 08:00

1 Answers1

10

You can wrap read_html in the GET function from httr package

e.g. if your original code was

library(rvest)
library(dplyr)

my_url <- "https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest"
x <- my_url %>% read_html(.)

then you could replace it with

library(httr)

# Allow 10 seconds
my_url %>% GET(., timeout(10)) %>% read_html

# Allow 30 seconds
my_url %>% GET(., timeout(30)) %>% read_html

Example

To put it to the test, try setting an extremely short timeout period (e.g. a hundredth of a second)

# Allow an unreasonably short amount of time so the request errors rather than hangs indefinitely

my_url %>% GET(., timeout(0.01)) %>% read_html

# Error in curl::curl_fetch_memory(url, handle = handle) : 
#   Timeout was reached: Resolving timed out after 10 milliseconds

You can find some more examples here

Using it in a loop (e.g. 'skip to the next if timed out)

Try running this code. It supposes you have a number (3 in this case) of urls to visit (one the second url below will delay 3 seconds before providing the html - a great way to test the functionality you're looking for). We set the timeout for 2 seconds so we know it will fail. The tryCatch() function will simply execute whatever code you put in as its second argument; in this case it will simply assign 'Timed out!' to the list element


my_urls <- c("https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest",
             "http://httpbin.org/delay/3", # This url will delay 3 seconds
             "http://httpbin.org/delay/1") 

x <- list()

# Set timeout for 2 seconds (so second url will fail)
for (i in 1:length(my_urls)) {

  print(paste0("Scraping url number ", i))

  tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html,
           error = function(e) { x[[i]] <<- "Timed out!" } )
  
}

Now we inspect the output - the first and third sites returned content, the second timed out

# > x
# [[1]]
# {xml_document}
# <html itemscope="" itemtype="http://schema.org/QAPage" class="html__responsive">
#   [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>r - how to set timeout ...
# [2] <body class="question-page unified-theme">\r\n    <div id="notify-container"></div>\r\n    <div id="custom ...
# 
# [[2]]
# [1] "Timed out!"
# 
# [[3]]
# {xml_document}
# <html>
# [1] <body><p>{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {}, \n  "headers": {\n    "Accept": ...


Obviously you can set the timeout value to whatever you want. 30 - 60 seconds could be sensible depending on the use.

stevec
  • 15,490
  • 6
  • 67
  • 110