You can wrap read_html
in the GET
function from httr
package
e.g. if your original code was
library(rvest)
library(dplyr)
my_url <- "https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest"
x <- my_url %>% read_html(.)
then you could replace it with
library(httr)
# Allow 10 seconds
my_url %>% GET(., timeout(10)) %>% read_html
# Allow 30 seconds
my_url %>% GET(., timeout(30)) %>% read_html
Example
To put it to the test, try setting an extremely short timeout period (e.g. a hundredth of a second)
# Allow an unreasonably short amount of time so the request errors rather than hangs indefinitely
my_url %>% GET(., timeout(0.01)) %>% read_html
# Error in curl::curl_fetch_memory(url, handle = handle) :
# Timeout was reached: Resolving timed out after 10 milliseconds
You can find some more examples here
Using it in a loop (e.g. 'skip to the next if timed out)
Try running this code. It supposes you have a number (3 in this case) of urls to visit (one the second url below will delay 3 seconds before providing the html - a great way to test the functionality you're looking for). We set the timeout for 2 seconds so we know it will fail. The tryCatch()
function will simply execute whatever code you put in as its second argument; in this case it will simply assign 'Timed out!' to the list element
my_urls <- c("https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest",
"http://httpbin.org/delay/3", # This url will delay 3 seconds
"http://httpbin.org/delay/1")
x <- list()
# Set timeout for 2 seconds (so second url will fail)
for (i in 1:length(my_urls)) {
print(paste0("Scraping url number ", i))
tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html,
error = function(e) { x[[i]] <<- "Timed out!" } )
}
Now we inspect the output - the first and third sites returned content, the second timed out
# > x
# [[1]]
# {xml_document}
# <html itemscope="" itemtype="http://schema.org/QAPage" class="html__responsive">
# [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>r - how to set timeout ...
# [2] <body class="question-page unified-theme">\r\n <div id="notify-container"></div>\r\n <div id="custom ...
#
# [[2]]
# [1] "Timed out!"
#
# [[3]]
# {xml_document}
# <html>
# [1] <body><p>{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {}, \n "headers": {\n "Accept": ...
Obviously you can set the timeout value to whatever you want. 30 - 60 seconds could be sensible depending on the use.